[00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003861 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003861 (owner: 10TrainBranchBot) [00:57:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003861 (owner: 10TrainBranchBot) [01:24:13] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004344 (https://phabricator.wikimedia.org/T349774) [02:08:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:38:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:53:25] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:32] (03PS6) 10KartikMistry: WIP: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [03:57:52] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:53:25] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:52] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:14:19] (03CR) 10ArielGlenn: [C: 03+2] Remove incorrect comments from the dumptextpass backfill script [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn) [05:14:43] (03Merged) 10jenkins-bot: Remove incorrect comments from the dumptextpass backfill script [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn) [05:15:48] (03CR) 10ArielGlenn: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [05:29:58] (03CR) 10Hashar: "To summarize that removes configuration settings which got broken years ago and our usage of them has been superseded by the Checks tab. T" [puppet] - 10https://gerrit.wikimedia.org/r/1002938 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [05:40:44] jouncebot: next [05:40:44] In 2 hour(s) and 19 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T0800) [05:49:24] (03PS1) 10Marostegui: Revert "mariadb: Promote pc2014 to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/1004180 [05:49:29] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004181 [05:52:14] (03PS2) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004181 [05:52:18] (03PS2) 10Marostegui: Revert "mariadb: Promote pc2014 to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/1004180 [05:53:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover pc1 T356371 [05:53:36] T356371: Switchover pc1 master (pc2014 -> pc2011) - https://phabricator.wikimedia.org/T356371 [05:53:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Primary switchover pc1 T356371 [05:54:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s2 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:21] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s2 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:22] ACKNOWLEDGEMENT - MariaDB Replica Lag: x1 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:22] ACKNOWLEDGEMENT - MariaDB Replica SQL: s2 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:23] ACKNOWLEDGEMENT - MariaDB Replica SQL: s6 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:23] ACKNOWLEDGEMENT - MariaDB Replica SQL: x1 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:54:24] ACKNOWLEDGEMENT - MariaDB read only s2 on db2097 is CRITICAL: Could not connect to localhost:3312 Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:54:24] ACKNOWLEDGEMENT - MariaDB read only s6 on db2097 is CRITICAL: Could not connect to localhost:3316 Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:54:25] ACKNOWLEDGEMENT - MariaDB read only x1 on db2097 is CRITICAL: Could not connect to localhost:3320 Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:54:25] ACKNOWLEDGEMENT - mysqld processes on db2097 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:57:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004181 (owner: 10Marostegui) [05:57:17] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc2014 to pc1 master" [puppet] - 10https://gerrit.wikimedia.org/r/1004180 (owner: 10Marostegui) [05:57:53] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004181 (owner: 10Marostegui) [05:58:31] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:1004181|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] [06:03:54] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9554047 (10Marostegui) [06:04:38] (03PS3) 10ArielGlenn: Make backfill script safer for automated use [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) [06:06:44] (03PS1) 10DLynch: Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) [06:08:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:08:51] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1004181|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:08:56] !log marostegui@deploy2002 marostegui: Continuing with sync [06:10:52] (03PS1) 10Marostegui: mariadb: Move db1170 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/1004352 (https://phabricator.wikimedia.org/T354826) [06:11:04] (03PS4) 10ArielGlenn: Make backfill script safer for automated use [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) [06:11:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1170 T354826', diff saved to https://phabricator.wikimedia.org/P56951 and previous config saved to /var/cache/conftool/dbconfig/20240219-061121-root.json [06:11:27] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:13:26] (03PS2) 10Marostegui: mariadb: Move db1170 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1004352 (https://phabricator.wikimedia.org/T354826) [06:14:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1170 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/1004352 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [06:15:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1170 in s7 T354826', diff saved to https://phabricator.wikimedia.org/P56952 and previous config saved to /var/cache/conftool/dbconfig/20240219-061548-marostegui.json [06:17:33] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:1004181|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] (duration: 19m 02s) [06:19:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1170 in s7 T354826', diff saved to https://phabricator.wikimedia.org/P56953 and previous config saved to /var/cache/conftool/dbconfig/20240219-061919-marostegui.json [06:19:34] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56954 and previous config saved to /var/cache/conftool/dbconfig/20240219-061957-root.json [06:34:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1244 T354826', diff saved to https://phabricator.wikimedia.org/P56955 and previous config saved to /var/cache/conftool/dbconfig/20240219-063457-root.json [06:35:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56956 and previous config saved to /var/cache/conftool/dbconfig/20240219-063502-root.json [06:35:05] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:38:07] (03PS1) 10Marostegui: mariadb: Move db1244 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1004502 (https://phabricator.wikimedia.org/T354826) [06:40:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1244 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1004502 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [06:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1244 in s4 T354826', diff saved to https://phabricator.wikimedia.org/P56957 and previous config saved to /var/cache/conftool/dbconfig/20240219-064157-marostegui.json [06:42:03] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1244 in s4 T354826', diff saved to https://phabricator.wikimedia.org/P56958 and previous config saved to /var/cache/conftool/dbconfig/20240219-064350-marostegui.json [06:50:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56959 and previous config saved to /var/cache/conftool/dbconfig/20240219-065007-root.json [06:50:12] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:50:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56960 and previous config saved to /var/cache/conftool/dbconfig/20240219-065048-root.json [06:54:42] (03PS1) 10Marostegui: mariadb: Place db1246 in s2 [puppet] - 10https://gerrit.wikimedia.org/r/1004503 (https://phabricator.wikimedia.org/T354826) [06:54:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1246 T354826', diff saved to https://phabricator.wikimedia.org/P56961 and previous config saved to /var/cache/conftool/dbconfig/20240219-065456-root.json [06:57:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1246 in s2 [puppet] - 10https://gerrit.wikimedia.org/r/1004503 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [06:58:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1246 multiinstance', diff saved to https://phabricator.wikimedia.org/P56962 and previous config saved to /var/cache/conftool/dbconfig/20240219-065848-marostegui.json [07:02:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1246 in s2 T354826', diff saved to https://phabricator.wikimedia.org/P56963 and previous config saved to /var/cache/conftool/dbconfig/20240219-070212-marostegui.json [07:02:29] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56964 and previous config saved to /var/cache/conftool/dbconfig/20240219-070511-root.json [07:05:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56965 and previous config saved to /var/cache/conftool/dbconfig/20240219-070552-root.json [07:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56966 and previous config saved to /var/cache/conftool/dbconfig/20240219-070556-root.json [07:08:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213 T354826', diff saved to https://phabricator.wikimedia.org/P56967 and previous config saved to /var/cache/conftool/dbconfig/20240219-070815-root.json [07:08:21] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:11:19] (03PS1) 10Marostegui: mariadb: Move db1213 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004504 (https://phabricator.wikimedia.org/T354826) [07:14:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1213 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004504 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [07:16:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1213 multiinstance', diff saved to https://phabricator.wikimedia.org/P56968 and previous config saved to /var/cache/conftool/dbconfig/20240219-071604-marostegui.json [07:16:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1213 in s5 T354826', diff saved to https://phabricator.wikimedia.org/P56969 and previous config saved to /var/cache/conftool/dbconfig/20240219-071658-marostegui.json [07:17:04] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:20:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56970 and previous config saved to /var/cache/conftool/dbconfig/20240219-072016-root.json [07:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56971 and previous config saved to /var/cache/conftool/dbconfig/20240219-072057-root.json [07:21:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56972 and previous config saved to /var/cache/conftool/dbconfig/20240219-072101-root.json [07:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56973 and previous config saved to /var/cache/conftool/dbconfig/20240219-072315-root.json [07:23:21] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:35:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56974 and previous config saved to /var/cache/conftool/dbconfig/20240219-073521-root.json [07:35:28] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:36:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56975 and previous config saved to /var/cache/conftool/dbconfig/20240219-073602-root.json [07:36:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56976 and previous config saved to /var/cache/conftool/dbconfig/20240219-073606-root.json [07:38:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56977 and previous config saved to /var/cache/conftool/dbconfig/20240219-073820-root.json [07:41:37] (03PS1) 10Marostegui: mariadb: Place db2168 in s7 [puppet] - 10https://gerrit.wikimedia.org/r/1004602 (https://phabricator.wikimedia.org/T354826) [07:41:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2168 T354826', diff saved to https://phabricator.wikimedia.org/P56978 and previous config saved to /var/cache/conftool/dbconfig/20240219-074148-root.json [07:41:54] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:43:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db2168 in s7 [puppet] - 10https://gerrit.wikimedia.org/r/1004602 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [07:44:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2168 multiinstance', diff saved to https://phabricator.wikimedia.org/P56979 and previous config saved to /var/cache/conftool/dbconfig/20240219-074450-marostegui.json [07:46:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2168 in s7 T354826', diff saved to https://phabricator.wikimedia.org/P56980 and previous config saved to /var/cache/conftool/dbconfig/20240219-074609-marostegui.json [07:50:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56981 and previous config saved to /var/cache/conftool/dbconfig/20240219-075035-root.json [07:50:41] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:51:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56982 and previous config saved to /var/cache/conftool/dbconfig/20240219-075107-root.json [07:51:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56983 and previous config saved to /var/cache/conftool/dbconfig/20240219-075111-root.json [07:51:59] (03CR) 10Awight: "Oof, thanks for this quick revert!" [extensions/Cite] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1003843 (https://phabricator.wikimedia.org/T357745) (owner: 10Bartosz Dziewoński) [07:53:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56984 and previous config saved to /var/cache/conftool/dbconfig/20240219-075325-root.json [07:59:48] 10sre-alert-triage, 10cloud-services-team: Alert in need of triage: Wikitech-static MW version up to date (instance wikitech-static.wikimedia.org) - https://phabricator.wikimedia.org/T357880#9554174 (10LSobanski) [08:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2167 T354826', diff saved to https://phabricator.wikimedia.org/P56985 and previous config saved to /var/cache/conftool/dbconfig/20240219-080322-root.json [08:03:38] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:05:04] (03PS1) 10Marostegui: mariadb: Move db2167 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1004611 (https://phabricator.wikimedia.org/T354826) [08:05:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56986 and previous config saved to /var/cache/conftool/dbconfig/20240219-080540-root.json [08:06:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56987 and previous config saved to /var/cache/conftool/dbconfig/20240219-080612-root.json [08:06:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56988 and previous config saved to /var/cache/conftool/dbconfig/20240219-080616-root.json [08:06:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2167 to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1004611 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [08:07:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2167 multiinstance', diff saved to https://phabricator.wikimedia.org/P56989 and previous config saved to /var/cache/conftool/dbconfig/20240219-080744-marostegui.json [08:08:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56990 and previous config saved to /var/cache/conftool/dbconfig/20240219-080831-root.json [08:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2167 in s8 T354826', diff saved to https://phabricator.wikimedia.org/P56991 and previous config saved to /var/cache/conftool/dbconfig/20240219-081132-marostegui.json [08:11:38] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:14:13] (03CR) 10Brouberol: [C: 03+2] superset: set puppet service states to production [puppet] - 10https://gerrit.wikimedia.org/r/1002362 (https://phabricator.wikimedia.org/T356483) (owner: 10Brouberol) [08:15:02] (03PS1) 10Marostegui: db2167: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004612 [08:16:12] !log installing runc security updates on buster [08:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:30] (03CR) 10Marostegui: [C: 03+2] db2167: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004612 (owner: 10Marostegui) [08:19:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [08:19:04] (03PS1) 10KartikMistry: WIP: Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) [08:19:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [08:19:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P56992 and previous config saved to /var/cache/conftool/dbconfig/20240219-081920-ladsgroup.json [08:19:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:20:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56993 and previous config saved to /var/cache/conftool/dbconfig/20240219-082045-root.json [08:20:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:toolforge: mailrelay: reject outbound emails without a sender [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) (owner: 10Majavah) [08:20:51] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:20:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) (owner: 10Majavah) [08:21:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56994 and previous config saved to /var/cache/conftool/dbconfig/20240219-082121-root.json [08:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:22:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:22:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:22:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:23:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2166 T354826', diff saved to https://phabricator.wikimedia.org/P56995 and previous config saved to /var/cache/conftool/dbconfig/20240219-082321-root.json [08:23:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56996 and previous config saved to /var/cache/conftool/dbconfig/20240219-082336-root.json [08:25:19] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet [08:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:33:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [08:33:58] (03PS1) 10Ladsgroup: Set fawiki to read new in pagelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004614 (https://phabricator.wikimedia.org/T351237) [08:34:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [08:35:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56997 and previous config saved to /var/cache/conftool/dbconfig/20240219-083550-root.json [08:35:56] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:37:48] 10sre-alert-triage, 10cloud-services-team, 10wikitech.wikimedia.org: Alert in need of triage: Wikitech-static MW version up to date (instance wikitech-static.wikimedia.org) - https://phabricator.wikimedia.org/T357880#9554249 (10Peachey88) [08:38:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56998 and previous config saved to /var/cache/conftool/dbconfig/20240219-083840-root.json [08:40:01] (03CR) 10Fabfur: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1004126 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [08:50:33] jouncebot: nowandnext [08:50:33] For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T0800) [08:50:34] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1100) [08:50:41] (03CR) 10Ladsgroup: [C: 03+2] Set fawiki to read new in pagelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004614 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [08:50:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P56999 and previous config saved to /var/cache/conftool/dbconfig/20240219-085055-root.json [08:51:01] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:51:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004614 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [08:51:22] (03Merged) 10jenkins-bot: Set fawiki to read new in pagelinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004614 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [08:51:42] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:1004614|Set fawiki to read new in pagelinks (T351237)]] [08:51:47] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [08:53:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1004614|Set fawiki to read new in pagelinks (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:53:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:01:26] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:1004614|Set fawiki to read new in pagelinks (T351237)]] (duration: 09m 43s) [09:01:30] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:03:01] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9554270 (10MoritzMuehlenhoff) [09:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57000 and previous config saved to /var/cache/conftool/dbconfig/20240219-090600-root.json [09:06:15] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [09:06:51] !log taavi@cumin1002 conftool action : set/pooled=inactive; selector: name=cloudweb1004.wikimedia.org [09:10:30] !log installing gnutls28 security updates on bookworm [09:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:21] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS bullseye [09:18:36] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:36] (JobUnavailable) resolved: Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:52] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [09:27:22] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [09:41:54] gmodena: is there anything specific I need to know to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1004156 ? Or would you rather I let y'all handle it? [09:42:19] It seems like it being a flink app means it needs a bit of care according to wikitech? [09:49:03] !log Draining mw2442 - failed RAID - T357380 [09:49:06] (MediaWikiEditFailures) firing: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:16] T357380: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380 [09:53:15] Session loss seems to be recovering on its own, maybe due to nutcracker failure earlier? [09:53:49] No, too late afterwards [09:54:06] (MediaWikiEditFailures) resolved: (2) Elevated MediaWiki edit failures (session_loss) for cluster appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [09:55:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1004.wikimedia.org with OS bullseye [09:59:10] !log taavi@cumin1002 conftool action : set/pooled=yes; selector: name=cloudweb1004.wikimedia.org [10:00:25] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:01:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:02:04] !log taavi@cumin1002 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::eqiad1::cloudweb [10:02:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:03:12] interesting [10:03:27] 👀 [10:03:33] grafana down [10:03:36] at least to me [10:03:39] the thanos-query service looks up [10:03:55] Feb 19 10:03:02 titan1001 thanos-query[306233]: level=warn ts=2024-02-19T10:01:46.819105691Z caller=endpointset.go:446 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from prometheus3003:29900: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prometheus3003:29900 [10:03:57] Grafana loaded fine for me (read-only) [10:04:06] I cannot query it [10:04:14] (03CR) 10Majavah: [C: 03+2] hieradata: convert cloudweb1003/4 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1004620 (owner: 10Majavah) [10:04:23] yeah grafana has no data for me [10:04:32] as in, grafana loads but cannot query data [10:04:32] Yes, that [10:04:43] !log restarting thanos-query.service - titan1001 [10:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:58] yes, that was the thing I was going to suggest based on ticket [10:05:17] it's starting... [10:05:20] T356788 [10:05:21] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [10:05:30] thanos.wikimedia.org returns 502 currently [10:05:34] if it comes back, the issue clearly hasn't been fixed [10:05:38] !log restarting thanos-query.service - titan1001 - T356788 [10:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:19] systemd failed to kill it [10:06:48] if it is unresponsive we can restart the server [10:07:04] I will check 1002 meanwhile [10:07:07] I'll try a manual sigkill [10:07:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:07:40] titan1002 is at 100% load [10:07:46] but otherwise fine [10:07:53] Manual sigkill worked, thanos-query started [10:07:56] Grafana shows data again in my browser [10:07:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:09] only 1 server cannot take the full load [10:08:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:08:36] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:08:36] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:51] I'll restart thanos-query on 1002 as well yeah? [10:08:53] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:20] yeah, it hasn't moved back from 100% [10:09:56] It it probably a load issue + knockdown effect [10:09:59] !log restarting thanos-query.service - titan1002 - T356788 [10:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:05] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:10:22] PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:10:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::eqiad1::cloudweb [10:10:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:11:19] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:11:20] ^ acked the alert [10:11:34] should be up on both now [10:11:44] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage [10:11:51] metrics are back now [10:11:57] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudvirt1032.eqiad.wmnet with reason: reimage [10:12:06] grafana and thanos.wikimedia.org work again [10:12:51] load is "normal" [10:12:59] I had to SIGKILL both manually btw [10:13:07] :-/ [10:13:36] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:51] I'm not sure we're out of the woods, I don't have working queries on my end [10:14:21] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:14:22] yeah, something is not right [10:14:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9554373 (10taavi) [10:14:29] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:14:49] some metrics are still failing to load [10:14:51] yeah dashboards are empty now again [10:15:17] I see queries go through but there's probably a deeper issue [10:15:21] godog you around ? [10:15:26] titan1002 overloaded again [10:15:40] back to 100% [10:16:21] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:16:29] and now it's ok [10:16:29] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:16:35] jynus: did you do something to 1002? [10:16:37] it is flapping [10:16:45] no, I am only monitoring [10:16:54] would ask first the active responder [10:17:05] which would be me X) [10:17:08] yep [10:17:23] Ah you meant you would ask me before doing something [10:17:24] unles htop overloads the server, ofc [10:17:25] gotcha [10:17:49] so I think there is 2 bad things here [10:18:00] 1 is the original issue (oom, overload, traffic?) [10:18:15] I didn't see any recent oom in dmesg [10:18:27] and 2 is that on overload, one server takes over and makes the other overload [10:18:36] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:36] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:41] I still don't have working grafa explore queries [10:18:50] it is flapping on and off [10:19:23] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:19:30] can we depool both, restart and pool both again when they are restarted properly? [10:20:16] jelto: we can try [10:20:18] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9554395 (10taavi) [10:20:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:20:29] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:20:36] I think we also need someone with a deeper understanding of that stack [10:20:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:03] !log depooling thanos-query eqiad - T356788 [10:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:08] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [10:22:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:22:26] !log cgoubert@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad [10:22:29] ^ acked pa.ge [10:22:46] !log restarting thanos-query.service - titan1002 - T356788 [10:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:36] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:31] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:26:12] !log restarting thanos-query.service - titan1001 - T356788 [10:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:53] even depooled it gets pinned at 100% [10:26:54] I'm a bit surprised load is still very high on titan hosts also with the depool [10:26:58] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:28:36] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:28:36] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:57] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan2002.codfw.wmnet are marked down but pooled: thanos-web_443: Servers titan2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:30:48] thanos in codfw is now overloaded as well? [10:30:53] (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1017:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:30:57] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kubernetes2026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:31:48] (PuppetZeroResources) firing: Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:32:11] claime what is the restart saying on titan eqiad hosts? [10:32:43] they're restarted and load looks ok [10:32:46] repooling [10:32:48] (PuppetZeroResources) firing: Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:32:53] ack [10:32:58] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:33:02] !log repooling thanos-query eqiad - T356788 [10:33:05] !log cgoubert@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:07] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [10:33:34] I can confirm, load on titan eqiad hosts looks fine again, acking the pa.ge ^ [10:33:36] (JobUnavailable) firing: (9) Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:48] (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:33:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan2002.codfw.wmnet are marked down but pooled: thanos-web_443: Servers titan2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:34:29] and now it's thanos in codfw that fails and is pinned [10:34:38] we're just pinballing the load between the two [10:34:49] (PuppetZeroResources) firing: Puppet has failed generate resources on restbase2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:34:49] (PuppetZeroResources) firing: Puppet has failed generate resources on ganeti2030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:34:53] I can't connect to titan2002.codfw.wmnet [10:35:10] claime: sigh, I'll take a look too [10:35:49] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2060:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:35:53] (PuppetZeroResources) firing: (11) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:36:42] hm probably we can try the same, with depool, restart, repool? [10:36:52] I got a connection to titan2002 now [10:36:57] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:36:58] same [10:37:15] load is high but ok-ish [10:37:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57001 and previous config saved to /var/cache/conftool/dbconfig/20240219-103741-root.json [10:37:46] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:37:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2166.codfw.wmnet onto db2167.codfw.wmnet [10:37:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:37:49] still no data in grafana though [10:37:53] (PuppetZeroResources) firing: Puppet has failed generate resources on ml-staging-etcd2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:02] (PuppetZeroResources) firing: Puppet has failed generate resources on kafka-main2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:06] (PuppetZeroResources) firing: Puppet has failed generate resources on kubestagemaster2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:33] ah, I got data now [10:38:35] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on lists2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:48] (PuppetZeroResources) firing: Puppet has failed generate resources on ml-etcd2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:38:51] me too (but quite slow) [10:38:54] <_joe_> something broke puppet everywhere I guess [10:39:23] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2137 T354826', diff saved to https://phabricator.wikimedia.org/P57002 and previous config saved to /var/cache/conftool/dbconfig/20240219-103939-root.json [10:39:49] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic2096:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on restbase2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:58] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:40:29] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Exception while executing '/srv/puppet_code/environments/production/utils/get_config7.sh': Cannot run program "/srv/puppet_code/environments/production/utils/get_config7.sh" (in directory "."): error=0, Failed to exec spawn helper: pid: 3479235, signal: 11 on node sretest1001.eqiad.wmnet [10:40:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mc2047:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:40:53] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on elastic2060:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:41:00] that doesn't exist [10:41:01] started around 9:00 UTC today roughly [10:41:03] awsome [10:41:06] (PuppetZeroResources) firing: (18) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:41:57] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ml-staging-etcd2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:53] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kafka-main2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:58] (03PS1) 10Marostegui: mariadb: Move db2137 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1004625 (https://phabricator.wikimedia.org/T354826) [10:43:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:43:51] jelto claime I'll poke thanos stuff on titan [10:43:55] titan2* that is [10:44:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2137 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/1004625 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [10:44:43] And now puppet works again [10:44:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on elastic2096:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:48] (PuppetZeroResources) firing: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on restbase2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:53] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:45:53] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2137 in s4 T354826', diff saved to https://phabricator.wikimedia.org/P57004 and previous config saved to /var/cache/conftool/dbconfig/20240219-104556-marostegui.json [10:45:57] (PuppetZeroResources) resolved: Puppet has failed generate resources on wdqs1017:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:02] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:46:06] claime: I can not confirm that, which host did you try? I tried phab2002 for example. Did you do anything?. Same error with get_config7.sh [10:46:06] (PuppetZeroResources) firing: (24) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:17] jelto: sretest1001 [10:46:44] And no, didn't do anything [10:46:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:55] (03PS1) 10Marostegui: db2137: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004627 [10:46:59] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:47:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:47:05] ah, puppet works on phab2002 again now [10:47:53] (PuppetZeroResources) firing: Puppet has failed generate resources on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:11] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:13] clouddb2002-dev is still failing [10:48:13] (03CR) 10Marostegui: [C: 03+2] db2137: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004627 (owner: 10Marostegui) [10:48:17] !log bounce thanos-query on titan2* - T356788 [10:48:20] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kafka-main2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:22] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [10:48:36] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:48:45] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on lists2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:50] (JobUnavailable) firing: (9) Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:49:03] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs2010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:25] claime: did I get it right from SAL that both codfw and eqiad are pooled for titan/thanos-query ATM? [10:49:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on elastic2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:49] (PuppetZeroResources) firing: Puppet has failed generate resources on rdb2007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:51] godog: I didn't touch the pooling status of codfw [10:49:53] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on restbase2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:58] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:58] godog: I depooled and repooled eqiad [10:50:05] claime: ack, thank you [10:50:12] so yeah, both pooled now [10:50:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mc-wf2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:50:49] (PuppetZeroResources) firing: Puppet has failed generate resources on mc-gp2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:50:57] (PuppetZeroResources) firing: (12) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:51:06] (PuppetZeroResources) firing: (34) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:51:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:52:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 into s4, depooled', diff saved to https://phabricator.wikimedia.org/P57005 and previous config saved to /var/cache/conftool/dbconfig/20240219-105211-marostegui.json [10:52:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57006 and previous config saved to /var/cache/conftool/dbconfig/20240219-105246-root.json [10:52:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:52:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on kafka-main2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:52:51] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57007 and previous config saved to /var/cache/conftool/dbconfig/20240219-105302-root.json [10:53:23] puppet fails still quite high, but slowly dropping: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=2 [10:53:28] hmm, my puppet run on clouddb2002-dev seems to be getting stuck on `Debug: Starting connection for https://puppetserver2003.codfw.wmnet:8140` [10:53:40] oh, it timed out eventually [10:53:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ml-etcd2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:53:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:54:03] and still failing with the same error in the end [10:54:05] (03PS6) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [10:54:07] https://www.irccloud.com/pastebin/I489gzkT/ [10:54:40] !log bounce thanos-query on titan1* - T356788 [10:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [10:54:46] puppet is being troubleshooted on -sre [10:54:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on cloudelastic1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:54:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:54:55] (03CR) 10Ayounsi: "Updated :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:54:59] taavi: thanks [10:55:20] a thanks taavi, I missed this [10:55:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mc-wf2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:55:53] (PuppetZeroResources) firing: (12) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:11] (PuppetZeroResources) firing: (45) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:16] !log restarting puppetserver on puppetserver1001 [10:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:57:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:58:02] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:58:11] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ml-staging-etcd2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:58:36] (JobUnavailable) resolved: (9) Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:58:48] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:59:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:59:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:59:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:59:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:59:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on rdb2007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:59:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on elastic2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:59:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on restbase2021:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:59:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T357189)', diff saved to https://phabricator.wikimedia.org/P57008 and previous config saved to /var/cache/conftool/dbconfig/20240219-105949-arnaudb.json [10:59:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:59:58] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti2014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1100) [11:00:27] !log roll-restarting puppetserver [11:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:30] !log sudo cumin -s 10 -b 1 A:puppetserver 'systemctl restart puppetserver.service' [11:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:48] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on mc-wf2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:00:57] (PuppetZeroResources) firing: (13) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:01:02] (PuppetZeroResources) firing: (52) Puppet has failed generate resources on kubemaster1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:01:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:03:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T357189)', diff saved to https://phabricator.wikimedia.org/P57009 and previous config saved to /var/cache/conftool/dbconfig/20240219-110312-arnaudb.json [11:03:23] (03PS1) 10Filippo Giunchedi: thanos: set only MemoryMax limit [puppet] - 10https://gerrit.wikimedia.org/r/1004630 (https://phabricator.wikimedia.org/T356788) [11:04:10] claime jelto can I trouble you for a quick review of ^ [11:04:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:05:25] (03CR) 10Clément Goubert: [C: 03+1] thanos: set only MemoryMax limit [puppet] - 10https://gerrit.wikimedia.org/r/1004630 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57010 and previous config saved to /var/cache/conftool/dbconfig/20240219-110525-root.json [11:05:31] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:05:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mc-wf2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:05:50] godog: iiuc memoryhigh makes it so it never actually releases the memory but throttles the process? [11:05:53] (PuppetZeroResources) firing: (14) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:05:54] So it never oomkills [11:06:02] (PuppetZeroResources) firing: (48) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:06:31] (+1'd anyways) [11:06:47] claime: exactly, hence why I'm removing memoryhigh [11:06:49] (PuppetZeroResources) firing: Puppet has failed generate resources on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:06:53] godog: yep, good call [11:07:02] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:07:07] claime: thank you! deploying now [11:07:11] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: set only MemoryMax limit [puppet] - 10https://gerrit.wikimedia.org/r/1004630 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [11:07:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ml-staging-etcd2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:07:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:07:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57011 and previous config saved to /var/cache/conftool/dbconfig/20240219-110751-root.json [11:08:06] !log puppetserver roll-restart done [11:08:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57012 and previous config saved to /var/cache/conftool/dbconfig/20240219-110806-root.json [11:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:09:48] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on elastic2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:09:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:09:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on restbase2021:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:09:53] !log Running puppet on failed nodes [11:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on elastic2089:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:10:48] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on mc-wf2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:10:49] (PuppetZeroResources) firing: (16) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:10:49] !log sudo cumin -b 20 -p 95 '*' 'run-puppet-agent -q --failed-only' [11:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:06] (PuppetZeroResources) firing: (43) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:11:11] (PuppetZeroResources) firing: Puppet has failed generate resources on ml-cache2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:11:14] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1032.eqiad.wmnet with OS bookworm [11:11:17] (PuppetZeroResources) firing: (43) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:11:27] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9554542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1032.eqiad.wmnet with OS... [11:11:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:12:19] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on rdb2007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:13:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:14:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on restbase1028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:14:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:15:58] (PuppetZeroResources) firing: (15) Puppet has failed generate resources on elastic1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:16:02] (PuppetZeroResources) firing: (10) Puppet has failed generate resources on mc-wf2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:16:11] (PuppetZeroResources) firing: (44) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:16:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:17:03] (03PS5) 10ArielGlenn: Make backfill script safer for automated use [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) [11:17:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kubestagemaster2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:17:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on ml-serve-ctrl2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:17:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ml-staging-etcd2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:18:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P57013 and previous config saved to /var/cache/conftool/dbconfig/20240219-111819-arnaudb.json [11:18:36] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:18:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:19:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on restbase1028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:19:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:20:20] I'm just going to silence PuppetZeroResources for like an hour if that's ok claime jelto ? [11:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57014 and previous config saved to /var/cache/conftool/dbconfig/20240219-112030-root.json [11:20:35] godog: yeah x) [11:20:36] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:20:46] yep works for me, I have the dashboard open anyways [11:20:49] (PuppetZeroResources) firing: (10) Puppet has failed generate resources on mc-wf2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:20:53] (PuppetZeroResources) firing: (15) Puppet has failed generate resources on elastic1076:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:20:55] godog: I'll tell you when the cumin run's done, we'll de-silence then [11:20:57] (PuppetZeroResources) firing: (43) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:21:04] (03CR) 10ArielGlenn: "This has been tested in deployment prep to make sure the following work:" [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [11:21:12] claime: cheers [11:22:31] (03CR) 10ArielGlenn: [C: 03+2] Make backfill script safer for automated use [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [11:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57015 and previous config saved to /var/cache/conftool/dbconfig/20240219-112256-root.json [11:22:59] (03Merged) 10jenkins-bot: Make backfill script safer for automated use [dumps] - 10https://gerrit.wikimedia.org/r/1004332 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [11:23:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57016 and previous config saved to /var/cache/conftool/dbconfig/20240219-112311-root.json [11:23:19] !log update cr*-codfw firewall policy for puppetmaster2003 -> puppetserver2003 rename [11:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:36] (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:23:39] (03PS1) 10Marostegui: mariadb: Move db2138 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/1004633 (https://phabricator.wikimedia.org/T354826) [11:24:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2138 T354826', diff saved to https://phabricator.wikimedia.org/P57017 and previous config saved to /var/cache/conftool/dbconfig/20240219-112405-root.json [11:25:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2138 to s2 [puppet] - 10https://gerrit.wikimedia.org/r/1004633 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [11:28:13] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [11:29:21] 10SRE, 10User-aborrero: ACPI kernel failure on debian installer last step - https://phabricator.wikimedia.org/T357896#9554605 (10aborrero) [11:34:10] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1032.eqiad.wmnet with reason: host reimage [11:36:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'place db2138 in s2', diff saved to https://phabricator.wikimedia.org/P57018 and previous config saved to /var/cache/conftool/dbconfig/20240219-113622-marostegui.json [11:36:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57019 and previous config saved to /var/cache/conftool/dbconfig/20240219-113627-root.json [11:36:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P57020 and previous config saved to /var/cache/conftool/dbconfig/20240219-113632-arnaudb.json [11:36:41] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:37:30] !log ariel@deploy2002 Started deploy [dumps/dumps@0d1f9be]: improvements to page content history backfill script [11:37:35] !log ariel@deploy2002 Finished deploy [dumps/dumps@0d1f9be]: improvements to page content history backfill script (duration: 00m 04s) [11:38:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2138 in s2 T354826', diff saved to https://phabricator.wikimedia.org/P57021 and previous config saved to /var/cache/conftool/dbconfig/20240219-113926-marostegui.json [11:39:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57022 and previous config saved to /var/cache/conftool/dbconfig/20240219-113931-root.json [11:39:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57023 and previous config saved to /var/cache/conftool/dbconfig/20240219-113931-root.json [11:39:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57024 and previous config saved to /var/cache/conftool/dbconfig/20240219-113934-root.json [11:41:17] 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9554672 (10jcrespo) a:03Papaul Hey, @Papaul I saw a login at 02/17/2024 05:24:45, barely a minute after a flash reset. Maybe some dc op saw something else when logging in, or there was some... [11:43:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:47:18] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9554681 (10AndrewTavis_WMDE) Thank you @hnowlan for the check in here. Final word on this is coming from @Manuel who will be back in the office on... [11:51:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57025 and previous config saved to /var/cache/conftool/dbconfig/20240219-115132-root.json [11:51:38] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:51:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T357189)', diff saved to https://phabricator.wikimedia.org/P57026 and previous config saved to /var/cache/conftool/dbconfig/20240219-115138-arnaudb.json [11:51:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:51:43] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:52:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:52:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57027 and previous config saved to /var/cache/conftool/dbconfig/20240219-115210-arnaudb.json [11:54:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57028 and previous config saved to /var/cache/conftool/dbconfig/20240219-115435-root.json [11:54:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57029 and previous config saved to /var/cache/conftool/dbconfig/20240219-115436-root.json [11:54:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57030 and previous config saved to /var/cache/conftool/dbconfig/20240219-115439-root.json [11:55:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on kafka-main1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:55:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57031 and previous config saved to /var/cache/conftool/dbconfig/20240219-115534-arnaudb.json [11:55:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on kubernetes2015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:58:09] (03PS1) 10Hnowlan: conftool-data: remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1004637 [12:01:39] (03CR) 10Clément Goubert: [C: 03+1] conftool-data: remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1004637 (owner: 10Hnowlan) [12:01:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:03:47] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1032.eqiad.wmnet with OS bookworm [12:03:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on wdqs2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:03:58] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9554726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1032.eqiad.wmnet with OS book... [12:04:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on restbase2030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57032 and previous config saved to /var/cache/conftool/dbconfig/20240219-120637-root.json [12:06:43] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [12:09:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57033 and previous config saved to /var/cache/conftool/dbconfig/20240219-120941-root.json [12:09:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57034 and previous config saved to /var/cache/conftool/dbconfig/20240219-120951-root.json [12:10:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P57035 and previous config saved to /var/cache/conftool/dbconfig/20240219-121040-arnaudb.json [12:11:18] (03PS5) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [12:13:38] (03PS6) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [12:14:12] (03CR) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [12:14:35] jouncebot: nowandnext [12:14:35] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [12:14:35] In 1 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1400) [12:18:30] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1004645 (owner: 10L10n-bot) [12:18:50] !log samtar@deploy2002 backport Cancelled [12:19:23] !log samtar@deploy2002 backport Cancelled [12:19:35] (03PS7) 10Samtar: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) [12:20:18] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [12:21:28] (03Merged) 10jenkins-bot: IS/CS: Add wmgEditRecoveryDefaultUserOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [12:21:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57037 and previous config saved to /var/cache/conftool/dbconfig/20240219-122142-root.json [12:21:45] !log samtar@deploy2002 Started scap: Backport for [[gerrit:992632|IS/CS: Add wmgEditRecoveryDefaultUserOptions (T350653)]] [12:21:51] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [12:21:59] T350653: Add user preference to enable Edit Recovery - https://phabricator.wikimedia.org/T350653 [12:23:06] !log samtar@deploy2002 samtar: Backport for [[gerrit:992632|IS/CS: Add wmgEditRecoveryDefaultUserOptions (T350653)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:23:10] * TheresNoTime testing [12:23:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003863 [12:24:31] !log samtar@deploy2002 samtar: Continuing with sync [12:24:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57038 and previous config saved to /var/cache/conftool/dbconfig/20240219-122446-root.json [12:25:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P57039 and previous config saved to /var/cache/conftool/dbconfig/20240219-122547-arnaudb.json [12:30:35] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas in order to migrate refreshLinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004062 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:31:13] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: migrate refreshLinks to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004063 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:31:17] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas in order to migrate refreshLinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004062 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:32:07] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:992632|IS/CS: Add wmgEditRecoveryDefaultUserOptions (T350653)]] (duration: 10m 21s) [12:32:11] T350653: Add user preference to enable Edit Recovery - https://phabricator.wikimedia.org/T350653 [12:32:26] (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas in order to migrate refreshLinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004062 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:35:26] !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [12:35:26] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [12:35:42] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:35:47] !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:35:49] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1032 [12:35:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1032 [12:36:10] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [12:36:10] !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [12:36:25] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:36:30] !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:37:34] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1032 [12:37:59] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1032 [12:38:02] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: migrate refreshLinks to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004063 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:39:45] (03Merged) 10jenkins-bot: changeprop-jobqueue: migrate refreshLinks to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004063 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:39:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57040 and previous config saved to /var/cache/conftool/dbconfig/20240219-123951-root.json [12:39:56] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [12:40:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57041 and previous config saved to /var/cache/conftool/dbconfig/20240219-124054-arnaudb.json [12:40:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:41:01] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:41:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:41:10] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1004645 (owner: 10L10n-bot) [12:41:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57042 and previous config saved to /var/cache/conftool/dbconfig/20240219-124115-arnaudb.json [12:42:04] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:42:38] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:43:46] !log migrating refreshLinks to k8s jobrunners [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:51] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:44:23] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:44:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57043 and previous config saved to /var/cache/conftool/dbconfig/20240219-124439-arnaudb.json [12:48:21] (03PS1) 10Arnaudb: mariadb: toggle notifications on db2169 [puppet] - 10https://gerrit.wikimedia.org/r/1003864 (https://phabricator.wikimedia.org/T343674) [12:53:51] (03PS1) 10Hnowlan: mw-jobrunner: begin to scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004662 (https://phabricator.wikimedia.org/T349796) [12:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57044 and previous config saved to /var/cache/conftool/dbconfig/20240219-125456-root.json [12:55:02] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [12:55:09] * claime lunch [12:56:41] (03PS1) 10Ladsgroup: mysql: Sleep one second before checking the position [cookbooks] - 10https://gerrit.wikimedia.org/r/1004663 [12:57:56] (03PS3) 10Samtar: [BETA CLUSTER] enable $wgCodeMirrorV6 on simplewiki, hewiki and en-rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004204 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [12:58:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004204 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [12:58:58] (03Merged) 10jenkins-bot: [BETA CLUSTER] enable $wgCodeMirrorV6 on simplewiki, hewiki and en-rtl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004204 (https://phabricator.wikimedia.org/T357795) (owner: 10MusikAnimal) [12:59:12] (03CR) 10Ladsgroup: [C: 03+1] "It's green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1003864 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:59:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications on db2169 [puppet] - 10https://gerrit.wikimedia.org/r/1003864 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:59:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P57045 and previous config saved to /var/cache/conftool/dbconfig/20240219-125945-arnaudb.json [13:01:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 10%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57046 and previous config saved to /var/cache/conftool/dbconfig/20240219-130116-arnaudb.json [13:08:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-jobrunner: begin to scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004662 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:09:47] (03CR) 10Marostegui: [C: 03+1] mysql: Sleep one second before checking the position [cookbooks] - 10https://gerrit.wikimedia.org/r/1004663 (owner: 10Ladsgroup) [13:10:42] (03CR) 10Ladsgroup: [C: 03+2] mysql: Sleep one second before checking the position [cookbooks] - 10https://gerrit.wikimedia.org/r/1004663 (owner: 10Ladsgroup) [13:12:36] (03PS1) 10Marostegui: mariadb: Move db2170 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1004665 (https://phabricator.wikimedia.org/T354826) [13:12:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2170 T354826', diff saved to https://phabricator.wikimedia.org/P57047 and previous config saved to /var/cache/conftool/dbconfig/20240219-131245-root.json [13:12:51] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [13:14:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db2170 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1004665 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [13:14:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P57048 and previous config saved to /var/cache/conftool/dbconfig/20240219-131452-arnaudb.json [13:14:57] (03Merged) 10jenkins-bot: mysql: Sleep one second before checking the position [cookbooks] - 10https://gerrit.wikimedia.org/r/1004663 (owner: 10Ladsgroup) [13:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db21170 multi-instance', diff saved to https://phabricator.wikimedia.org/P57049 and previous config saved to /var/cache/conftool/dbconfig/20240219-131609-marostegui.json [13:16:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 20%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57050 and previous config saved to /var/cache/conftool/dbconfig/20240219-131620-arnaudb.json [13:16:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1022.eqiad.wmnet with reason: Silence for reboot T356240 [13:17:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1022.eqiad.wmnet with reason: Silence for reboot T356240 [13:17:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2170 depooled', diff saved to https://phabricator.wikimedia.org/P57051 and previous config saved to /var/cache/conftool/dbconfig/20240219-131729-marostegui.json [13:21:17] (03PS1) 10Marostegui: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) [13:22:27] (03CR) 10Marostegui: [C: 04-2] "Tuesday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:22:29] (03PS1) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/1004669 (https://phabricator.wikimedia.org/T356372) [13:23:38] (03PS1) 10Marostegui: wmnet: Promote es2020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/1004670 (https://phabricator.wikimedia.org/T356372) [13:23:59] (03CR) 10Marostegui: [C: 04-2] "Tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/1004669 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:24:04] (03CR) 10Marostegui: [C: 04-2] "Tuesday" [dns] - 10https://gerrit.wikimedia.org/r/1004670 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:26:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1025.eqiad.wmnet with reason: Silence for reboot T356240 [13:26:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1025.eqiad.wmnet with reason: Silence for reboot T356240 [13:28:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57052 and previous config saved to /var/cache/conftool/dbconfig/20240219-132858-root.json [13:29:03] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [13:29:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57053 and previous config saved to /var/cache/conftool/dbconfig/20240219-132958-arnaudb.json [13:30:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:30:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:30:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:30:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T357189)', diff saved to https://phabricator.wikimedia.org/P57054 and previous config saved to /var/cache/conftool/dbconfig/20240219-133019-arnaudb.json [13:30:58] !log installing runc security updates on buster [13:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 30%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57055 and previous config saved to /var/cache/conftool/dbconfig/20240219-133125-arnaudb.json [13:32:22] (03PS1) 10Marostegui: es1021: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004671 [13:32:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1021', diff saved to https://phabricator.wikimedia.org/P57056 and previous config saved to /var/cache/conftool/dbconfig/20240219-133245-root.json [13:33:19] (03PS12) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [13:33:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T357189)', diff saved to https://phabricator.wikimedia.org/P57057 and previous config saved to /var/cache/conftool/dbconfig/20240219-133339-arnaudb.json [13:33:48] (03CR) 10Marostegui: [C: 03+2] es1021: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004671 (owner: 10Marostegui) [13:34:40] PROBLEM - ensure kvm processes are running on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:35:38] PROBLEM - Host dbproxy1020 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:47] its me sorry forgot to downtime it [13:35:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1020.eqiad.wmnet with reason: Silence for reboot T356240 [13:36:08] RECOVERY - Host dbproxy1020 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:36:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1020.eqiad.wmnet with reason: Silence for reboot T356240 [13:37:16] (03PS1) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [13:37:47] (03PS1) 10Marostegui: es4: Switchover eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/1004673 (https://phabricator.wikimedia.org/T357904) [13:37:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1021.eqiad.wmnet with reason: Silence for reboot T356240 [13:38:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1021.eqiad.wmnet with reason: Silence for reboot T356240 [13:40:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: es4 switchover T357904 [13:40:34] T357904: Migrate es4 eqiad to MariaDB 10.6 - https://phabricator.wikimedia.org/T357904 [13:40:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: es4 switchover T357904 [13:41:14] (03CR) 10Marostegui: [C: 03+2] es4: Switchover eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/1004673 (https://phabricator.wikimedia.org/T357904) (owner: 10Marostegui) [13:42:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Change weight of es1021', diff saved to https://phabricator.wikimedia.org/P57058 and previous config saved to /var/cache/conftool/dbconfig/20240219-134205-root.json [13:42:27] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/1004669 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:42:42] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:43:12] (03CR) 10Ayounsi: "Thanks !" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:43:24] !log Starting es4 eqiad failover from es1020 to es1021 - T357904 [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57059 and previous config saved to /var/cache/conftool/dbconfig/20240219-134402-root.json [13:44:08] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [13:45:14] (03PS1) 10Arnaudb: dbproxy: switch main servers on eqiad [dns] - 10https://gerrit.wikimedia.org/r/1003865 (https://phabricator.wikimedia.org/T356240) [13:45:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1021 to es4 primary ', diff saved to https://phabricator.wikimedia.org/P57060 and previous config saved to /var/cache/conftool/dbconfig/20240219-134551-root.json [13:46:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 40%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57061 and previous config saved to /var/cache/conftool/dbconfig/20240219-134630-arnaudb.json [13:46:32] (03PS2) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [13:47:50] (03PS1) 10Btullis: airflow: change max_active_runs_per_dag back to 3 [puppet] - 10https://gerrit.wikimedia.org/r/1004184 [13:48:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1020', diff saved to https://phabricator.wikimedia.org/P57062 and previous config saved to /var/cache/conftool/dbconfig/20240219-134804-root.json [13:48:20] (03CR) 10Arnaudb: [C: 03+1] wmnet: Promote es2020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/1004670 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [13:48:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P57063 and previous config saved to /var/cache/conftool/dbconfig/20240219-134845-arnaudb.json [13:49:00] (03PS1) 10Marostegui: mariadb: Migrate es1020 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004675 (https://phabricator.wikimedia.org/T357904) [13:50:10] (03CR) 10Marostegui: [C: 03+1] dbproxy: switch main servers on eqiad [dns] - 10https://gerrit.wikimedia.org/r/1003865 (https://phabricator.wikimedia.org/T356240) (owner: 10Arnaudb) [13:50:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Migrate es1020 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004675 (https://phabricator.wikimedia.org/T357904) (owner: 10Marostegui) [13:50:47] (03CR) 10Arnaudb: [C: 03+2] dbproxy: switch main servers on eqiad [dns] - 10https://gerrit.wikimedia.org/r/1003865 (https://phabricator.wikimedia.org/T356240) (owner: 10Arnaudb) [13:53:01] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9555018 (10MoritzMuehlenhoff) Thanks! There is a pre-existing task,I'll merge that in. [13:53:23] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9555020 (10MoritzMuehlenhoff) [13:54:27] (03PS3) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [13:55:36] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1391/co" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [13:57:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57064 and previous config saved to /var/cache/conftool/dbconfig/20240219-135722-root.json [13:57:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1023.eqiad.wmnet with reason: Silence for reboot T356240 [13:57:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1023.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1024.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1024.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1026.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1026.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:36] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:58:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on dbproxy1027.eqiad.wmnet with reason: Silence for reboot T356240 [13:58:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on dbproxy1027.eqiad.wmnet with reason: Silence for reboot T356240 [13:59:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57065 and previous config saved to /var/cache/conftool/dbconfig/20240219-135907-root.json [13:59:22] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1400). [14:00:05] Gerges: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:01:21] (unable to deploy at the moment, sorry!( [14:01:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 50%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57066 and previous config saved to /var/cache/conftool/dbconfig/20240219-140135-arnaudb.json [14:03:07] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:03:36] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:03:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P57067 and previous config saved to /var/cache/conftool/dbconfig/20240219-140351-arnaudb.json [14:03:55] also unable to deploy rn [14:03:58] might be around later in the window [14:05:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P57068 and previous config saved to /var/cache/conftool/dbconfig/20240219-140538-ladsgroup.json [14:05:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:08:47] * Lucas_WMDE around now [14:09:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] function-evaluator: Bump mesh dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003604 (owner: 10Alexandros Kosiaris) [14:10:19] (03Merged) 10jenkins-bot: function-evaluator: Bump mesh dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003604 (owner: 10Alexandros Kosiaris) [14:10:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "+1 to run CI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (owner: 10GergesShamon) [14:12:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57069 and previous config saved to /var/cache/conftool/dbconfig/20240219-141227-root.json [14:12:39] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9555097 (10MoritzMuehlenhoff) [14:14:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57070 and previous config saved to /var/cache/conftool/dbconfig/20240219-141412-root.json [14:14:22] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:16:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 75%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57071 and previous config saved to /var/cache/conftool/dbconfig/20240219-141640-arnaudb.json [14:18:12] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:18:52] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:18:53] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:18:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T357189)', diff saved to https://phabricator.wikimedia.org/P57072 and previous config saved to /var/cache/conftool/dbconfig/20240219-141858-arnaudb.json [14:19:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:19:04] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:19:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:19:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T357189)', diff saved to https://phabricator.wikimedia.org/P57073 and previous config saved to /var/cache/conftool/dbconfig/20240219-141919-arnaudb.json [14:19:52] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:55] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:14] (03PS1) 10Kamila Součková: mw-page-content-change-enrich: switch API endpoint to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004679 (https://phabricator.wikimedia.org/T357907) [14:20:25] (03CR) 10Cathal Mooney: [C: 03+2] Modify K8s BGP groups to only enable multihop on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1003619 (https://phabricator.wikimedia.org/T357619) (owner: 10Cathal Mooney) [14:20:28] 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555132 (10Clement_Goubert) [14:20:41] 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555135 (10Clement_Goubert) 05duplicate→03In progress [14:20:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57074 and previous config saved to /var/cache/conftool/dbconfig/20240219-142044-ladsgroup.json [14:20:50] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9555136 (10Clement_Goubert) [14:20:50] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:21:05] (03Merged) 10jenkins-bot: Modify K8s BGP groups to only enable multihop on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1003619 (https://phabricator.wikimedia.org/T357619) (owner: 10Cathal Mooney) [14:21:23] (ftr, Gerges is online but having trouble joining this channel, see #wikimedia-releng) [14:21:31] (we’ll see if the deployment happens or not) [14:22:29] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: CAS-based services (?) lose the session after an hour - https://phabricator.wikimedia.org/T268233#9555141 (10fgiunchedi) [14:22:33] (03CR) 10Brouberol: [C: 03+1] "I followed the conversation on IRC, :+1:" [puppet] - 10https://gerrit.wikimedia.org/r/1004184 (owner: 10Btullis) [14:22:41] (03PS2) 10Alexandros Kosiaris: deploy: Add mw-parsoid namespace stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004149 (https://phabricator.wikimedia.org/T357392) [14:22:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T357189)', diff saved to https://phabricator.wikimedia.org/P57075 and previous config saved to /var/cache/conftool/dbconfig/20240219-142238-arnaudb.json [14:22:41] (03PS2) 10Alexandros Kosiaris: mw-parsoid: Have deployments happening [puppet] - 10https://gerrit.wikimedia.org/r/1004150 (https://phabricator.wikimedia.org/T357392) [14:22:42] (03PS2) 10Alexandros Kosiaris: conftool: Add mw-parsoid stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004151 (https://phabricator.wikimedia.org/T357392) [14:22:45] (03PS3) 10Alexandros Kosiaris: service::catalog: Add mw-parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/1004152 (https://phabricator.wikimedia.org/T357392) [14:22:46] (03PS3) 10Alexandros Kosiaris: mw-parsoid: Add LVS backends on wikikube servers [puppet] - 10https://gerrit.wikimedia.org/r/1004153 (https://phabricator.wikimedia.org/T357392) [14:22:48] (03PS3) 10Alexandros Kosiaris: mw-parsoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1004154 (https://phabricator.wikimedia.org/T357392) [14:22:50] (03PS3) 10Alexandros Kosiaris: mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392) [14:23:09] (03PS2) 10Alexandros Kosiaris: mw-parsoid: Introduce it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004157 (https://phabricator.wikimedia.org/T357392) [14:23:57] (03PS1) 10Filippo Giunchedi: grafana: provision thanos-downsample datasource [puppet] - 10https://gerrit.wikimedia.org/r/1004680 [14:24:42] jouncebot: nowandnext [14:24:43] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1400) [14:24:43] In 2 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1630) [14:25:10] (03PS3) 10Reedy: Fix casing of Mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993738 [14:25:12] (03Abandoned) 10Kamila Součková: mw-page-content-change-enrich: switch API endpoint to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004679 (https://phabricator.wikimedia.org/T357907) (owner: 10Kamila Součková) [14:25:14] (03CR) 10Reedy: [C: 03+2] Fix casing of Mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993738 (owner: 10Reedy) [14:26:26] (03Merged) 10jenkins-bot: Fix casing of Mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993738 (owner: 10Reedy) [14:27:01] 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555154 (10Clement_Goubert) We're all set for this, according to [[ https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment#Upgr... [14:27:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57076 and previous config saved to /var/cache/conftool/dbconfig/20240219-142732-root.json [14:28:33] !log reedy@deploy2002 Started scap: Fix casing of MediaWiki [14:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57077 and previous config saved to /var/cache/conftool/dbconfig/20240219-142917-root.json [14:29:23] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:31:37] (03PS3) 10Slyngshede: Puppetmaster: Alert when unmerged changes exists in Puppet repo. [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) [14:31:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 100%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57078 and previous config saved to /var/cache/conftool/dbconfig/20240219-143145-arnaudb.json [14:31:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 10%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57079 and previous config saved to /var/cache/conftool/dbconfig/20240219-143150-arnaudb.json [14:32:42] (03CR) 10Slyngshede: Puppetmaster: Alert when unmerged changes exists in Puppet repo. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:34:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:35:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57080 and previous config saved to /var/cache/conftool/dbconfig/20240219-143550-ladsgroup.json [14:36:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:37:45] !log reedy@deploy2002 Finished scap: Fix casing of MediaWiki (duration: 09m 11s) [14:37:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P57081 and previous config saved to /var/cache/conftool/dbconfig/20240219-143744-arnaudb.json [14:37:48] <_Gerges> Hi [14:38:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:12] (03CR) 10Btullis: [C: 03+2] airflow: change max_active_runs_per_dag back to 3 [puppet] - 10https://gerrit.wikimedia.org/r/1004184 (owner: 10Btullis) [14:40:30] <_Gerges> Lucas_WMDE: I joined [14:40:51] hi! [14:41:03] Reedy: are you done deploying? [14:41:12] Yeah :) [14:41:21] <_Gerges> Yes [14:41:43] ok, then I’ll do the backport window now [14:41:58] _Gerges: is this your first time in a deployment window? [14:42:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57082 and previous config saved to /var/cache/conftool/dbconfig/20240219-144237-root.json [14:42:38] <_Gerges> Yes [14:42:41] (03CR) 10Slyngshede: P:debmonitor::server Add CDN endpoint check. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003409 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:42:51] ok, I’ll try to explain the process as we go along :) [14:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57083 and previous config saved to /var/cache/conftool/dbconfig/20240219-144422-root.json [14:44:28] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:44:38] so, I’ve reviewed the change at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1000292, and it looks good to me [14:44:50] (especially the diffConfig output from the main test build: https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/6114/console) [14:44:56] (03PS3) 10Lucas Werkmeister (WMDE): Increase move rate limit for extendedmovers in arwiki to 16/60 T357229 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (owner: 10GergesShamon) [14:45:15] oh, hang on, one more thing [14:45:32] the bug should be linked properly in the commit message, as a separate line “Bug: T357229” [14:45:33] T357229: Increase move rate limit for extendedmovers in arwiki - https://phabricator.wikimedia.org/T357229 [14:45:34] right above the Change-Id [14:45:46] _Gerges: can you update the commit message? [14:45:57] (it should be possible to do that directly in Gerrit) [14:46:05] (or you can amend the commit locally and push it) [14:46:44] <_Gerges> I don't speak English well, so I'll wait to translate your message :) [14:46:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 20%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57084 and previous config saved to /var/cache/conftool/dbconfig/20240219-144655-arnaudb.json [14:47:16] ok :) [14:48:03] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [14:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:49:31] _Gerges: here is an example of how the “Bug:” line should look: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1003760 [14:49:43] (the rest of your commit message is okay) [14:49:55] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 51s) [14:50:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P57085 and previous config saved to /var/cache/conftool/dbconfig/20240219-145057-ladsgroup.json [14:51:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [14:51:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:51:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [14:51:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T352010)', diff saved to https://phabricator.wikimedia.org/P57086 and previous config saved to /var/cache/conftool/dbconfig/20240219-145119-ladsgroup.json [14:51:41] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [14:51:59] <_Gerges> Lucas_WMDE: What do I do [14:52:37] I can also edit the commit message, if that’s okay for you [14:52:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P57087 and previous config saved to /var/cache/conftool/dbconfig/20240219-145251-arnaudb.json [14:53:06] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 24s) [14:53:31] <_Gerges> Ok [14:53:56] (03PS4) 10Lucas Werkmeister (WMDE): Increase move rate limit for extendedmovers in arwiki to 16/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (https://phabricator.wikimedia.org/T357229) (owner: 10GergesShamon) [14:54:01] (03CR) 10MVernon: "Sorry, probably a stupid question, but: where do the numbers in this change come from?" [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:54:03] (03CR) 10Gergő Tisza: "I scheduled it for the Tue 17:00 deploy window: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1700" [puppet] - 10https://gerrit.wikimedia.org/r/1003890 (owner: 10Gergő Tisza) [14:54:08] ok, done [14:54:14] and now you can see that it also gets linked on Phabricator: https://phabricator.wikimedia.org/T357229#9555211 [14:56:59] !log lucaswerkmeister-wmde@deploy2002 Backport cancelled. [14:57:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (https://phabricator.wikimedia.org/T357229) (owner: 10GergesShamon) [14:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57088 and previous config saved to /var/cache/conftool/dbconfig/20240219-145742-root.json [14:57:59] (03Merged) 10jenkins-bot: Increase move rate limit for extendedmovers in arwiki to 16/60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000292 (https://phabricator.wikimedia.org/T357229) (owner: 10GergesShamon) [14:58:13] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1000292|Increase move rate limit for extendedmovers in arwiki to 16/60 (T357229)]] [14:58:23] _Gerges: the change was just merged [14:58:35] now it will be deployed to a set of test servers [14:58:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:04] there is a browser extension you can use so that your request reach these test servers instead of the normal ones https://wikitech.wikimedia.org/wiki/WikimediaDebug [14:59:13] so normally I would soon ask you to test your changes on the test servers, using this extension [14:59:36] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and gergesshamon: Backport for [[gerrit:1000292|Increase move rate limit for extendedmovers in arwiki to 16/60 (T357229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:38] but for a rate limit change, that probably doesn’t make much sense – it would be difficult to test that change [14:59:49] ^ there’s the notification that the change is ready for testing now [15:00:03] (03CR) 10Btullis: [C: 03+1] "Looks great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [15:01:10] (03PS2) 10Esanders: Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [15:01:15] (03CR) 10Slyngshede: "Seems like a fairly reasonable question. The percentages are just lifted from the old Icinga alert." [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:01:25] welp, they’re gone… [15:02:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 30%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57089 and previous config saved to /var/cache/conftool/dbconfig/20240219-150200-arnaudb.json [15:03:13] Gerges: do you think you can test the change before I deploy it everywhere? [15:03:25] otherwise I’ll just deploy it [15:04:21] When I reinstall WikimediaDebug on chrome, it doesn't install [15:04:34] (03CR) 10Brouberol: [C: 03+2] superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [15:04:40] (03PS9) 10Brouberol: superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) [15:04:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213', diff saved to https://phabricator.wikimedia.org/P57090 and previous config saved to /var/cache/conftool/dbconfig/20240219-150451-root.json [15:04:52] hm, strange [15:04:57] is there any error message? [15:06:02] (03PS1) 10Marostegui: db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004730 [15:06:18] No error appears, wait I will upload as a zip file [15:07:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1213.eqiad.wmnet with OS bookworm [15:07:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T357189)', diff saved to https://phabricator.wikimedia.org/P57091 and previous config saved to /var/cache/conftool/dbconfig/20240219-150757-arnaudb.json [15:07:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [15:08:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [15:08:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T357189)', diff saved to https://phabricator.wikimedia.org/P57092 and previous config saved to /var/cache/conftool/dbconfig/20240219-150819-arnaudb.json [15:08:36] (03CR) 10Brouberol: [V: 03+2 C: 03+2] superset: enable OIDC login [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003747 (https://phabricator.wikimedia.org/T353794) (owner: 10Brouberol) [15:09:34] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [15:09:36] Lucas_WMDE: You installed [15:09:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [15:09:48] what do you mean? [15:10:06] I installed WikimediaDebug [15:10:31] okay, great! [15:10:44] normally, you could now enable it and test your change that way [15:10:52] but I don’t think this actually works for a rate limit change [15:10:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [15:11:27] is there anything you want to test? [15:11:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T357189)', diff saved to https://phabricator.wikimedia.org/P57093 and previous config saved to /var/cache/conftool/dbconfig/20240219-151127-arnaudb.json [15:11:30] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 55s) [15:11:32] otherwise I will just deploy the change, I think [15:11:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:11:58] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [15:12:18] The change is correct, you confirmed that the limit was increased via the API [15:12:27] https://ar.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=ratelimits [15:12:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57094 and previous config saved to /var/cache/conftool/dbconfig/20240219-151246-root.json [15:13:06] I see [15:13:14] hang on, I’m not logged in on ar.wikipedia.org so I don’t see anything there [15:13:26] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 28s) [15:13:41] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [15:14:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [15:14:12] You check the changes, deploy the changes [15:14:30] * I Check [15:14:35] I see move ratelimits for “user” and “newbie” in https://ar.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=ratelimits [15:14:41] probably because I’m not an extendedmover [15:14:48] do you also see an extendedmover section there? [15:14:49] Yes [15:14:53] ok, that’s good enough for me :) [15:14:55] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and gergesshamon: Continuing with sync [15:15:07] ^ so now the change will get deployed in the next few minutes [15:15:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [15:15:36] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 55s) [15:17:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 40%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57095 and previous config saved to /var/cache/conftool/dbconfig/20240219-151706-arnaudb.json [15:17:13] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [15:18:36] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9555284 (10MoritzMuehlenhoff) p:05Triage→03High [15:19:07] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 54s) [15:19:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [15:20:35] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [15:22:05] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 01m 30s) [15:22:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [15:22:48] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1000292|Increase move rate limit for extendedmovers in arwiki to 16/60 (T357229)]] (duration: 24m 34s) [15:22:53] T357229: Increase move rate limit for extendedmovers in arwiki - https://phabricator.wikimedia.org/T357229 [15:22:53] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase2024 [15:23:02] Gerges: the change should now be deployed everywhere [15:23:13] you can disable WikimediaDebug and test it again if you want [15:23:17] otherwise we should be done here :) [15:23:46] (03PS3) 10Raymond Ndibe: [domainproxy]: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) [15:23:59] !log UTC afternoon backport+config window done [15:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:17] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase2024 (duration: 01m 24s) [15:24:28] (03CR) 10David Caro: [C: 03+2] [domainproxy]: increase client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/998659 (https://phabricator.wikimedia.org/T351178) (owner: 10Raymond Ndibe) [15:25:07] I disabled WikimediaDebug, and all changes deployed [15:26:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P57096 and previous config saved to /var/cache/conftool/dbconfig/20240219-152634-arnaudb.json [15:28:35] (03CR) 10Marostegui: [C: 03+2] db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004730 (owner: 10Marostegui) [15:28:52] Lucas_WMDE: Is there anything else, I will close my chat [15:29:03] Gerges: no, nothing else :) [15:29:42] Lucas_WMDE: Thanks [15:31:34] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[1023:1025] [15:32:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 50%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57097 and previous config saved to /var/cache/conftool/dbconfig/20240219-153211-arnaudb.json [15:32:47] (03PS1) 10Marostegui: Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004707 [15:33:31] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[1023:1025] (duration: 01m 57s) [15:33:50] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase1026 [15:35:46] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase1026 (duration: 01m 55s) [15:36:25] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[2025:2028] [15:37:40] (03PS1) 10Arnaudb: mariadb: fix db2194 instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1004690 (https://phabricator.wikimedia.org/T343674) [15:37:53] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[2025:2028] (duration: 01m 28s) [15:38:26] (03CR) 10Marostegui: [C: 03+1] mariadb: fix db2194 instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1004690 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:38:54] (03CR) 10Arnaudb: [C: 03+2] mariadb: fix db2194 instances.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1004690 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57098 and previous config saved to /var/cache/conftool/dbconfig/20240219-153938-root.json [15:39:46] (03CR) 10Marostegui: [C: 03+2] Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004707 (owner: 10Marostegui) [15:41:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1213.eqiad.wmnet with OS bookworm [15:41:48] (03PS1) 10Clément Goubert: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) [15:41:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T343674 - db2194 missing config', diff saved to https://phabricator.wikimedia.org/P57099 and previous config saved to /var/cache/conftool/dbconfig/20240219-154148-arnaudb.json [15:41:53] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9555409 (10Jelto) A test instance running etherpad-lite 1.9.7 is available on https://etherpad.wmcloud.org/ now. Creating a new pad works and the w... [15:41:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P57100 and previous config saved to /var/cache/conftool/dbconfig/20240219-154154-arnaudb.json [15:42:00] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [15:42:57] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9555418 (10Jelto) [15:43:51] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#8190736 (10Jelto) [15:45:11] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1004691 (https://phabricator.wikimedia.org/T343674) [15:47:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 75%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57101 and previous config saved to /var/cache/conftool/dbconfig/20240219-154716-arnaudb.json [15:47:27] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1004691 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:47:41] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1004691 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [15:48:09] (03PS2) 10Clément Goubert: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) [15:49:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 1%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57102 and previous config saved to /var/cache/conftool/dbconfig/20240219-154904-arnaudb.json [15:51:29] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[1027:1030] [15:53:41] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445#9555474 (10phaultfinder) [15:53:58] (03PS3) 10Samtar: IS: Enable edit recovery on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992203 (https://phabricator.wikimedia.org/T355548) [15:54:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57103 and previous config saved to /var/cache/conftool/dbconfig/20240219-155443-root.json [15:55:34] (03Abandoned) 10Samtar: IS: Enable edit recovery on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992203 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [15:55:40] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[1027:1030] (duration: 04m 11s) [15:56:11] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[2029:2032] [15:57:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T357189)', diff saved to https://phabricator.wikimedia.org/P57104 and previous config saved to /var/cache/conftool/dbconfig/20240219-155702-arnaudb.json [15:57:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:57:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:57:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:59:07] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: Disable parsoid storage on restbase[2029:2032] (duration: 02m 56s) [15:59:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:59:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [15:59:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T357189)', diff saved to https://phabricator.wikimedia.org/P57105 and previous config saved to /var/cache/conftool/dbconfig/20240219-155936-arnaudb.json [16:00:37] (03PS1) 10Samtar: InitialiseSettings: Enable Edit Recovery on mw and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) [16:02:02] (03PS1) 10Marostegui: db1213: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1004738 [16:02:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 100%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57106 and previous config saved to /var/cache/conftool/dbconfig/20240219-160221-arnaudb.json [16:02:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T357189)', diff saved to https://phabricator.wikimedia.org/P57107 and previous config saved to /var/cache/conftool/dbconfig/20240219-160249-arnaudb.json [16:03:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:04:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 2%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57108 and previous config saved to /var/cache/conftool/dbconfig/20240219-160409-arnaudb.json [16:04:11] (03CR) 10Marostegui: [C: 03+2] db1213: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1004738 (owner: 10Marostegui) [16:04:26] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [16:04:46] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 00m 23s) [16:07:27] (03PS2) 10Samtar: InitialiseSettings: Enable Edit Recovery on 4 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) [16:07:31] (03PS3) 10Alexandros Kosiaris: mw-parsoid: Introduce it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004157 (https://phabricator.wikimedia.org/T357392) [16:07:32] (03PS1) 10Alexandros Kosiaris: admin_ng: mw-parsoid stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004739 (https://phabricator.wikimedia.org/T357392) [16:08:32] 10SRE, 10Bitu, 10Infrastructure-Foundations: Create an IDM for Wikimedia developer accounts - https://phabricator.wikimedia.org/T319405#9555522 (10SLyngshede-WMF) p:05Triage→03Low [16:08:53] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9555525 (10cmooney) p:05Triage→03Low [16:09:08] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9555526 (10cmooney) p:05Triage→03Low [16:09:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57109 and previous config saved to /var/cache/conftool/dbconfig/20240219-160948-root.json [16:10:04] (03CR) 10MVernon: [C: 04-1] "Hi," [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [16:12:42] 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9555538 (10ayounsi) p:05Triage→03Low a:03ayounsi [16:14:31] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [16:14:43] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 00m 08s) [16:16:59] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: Deploy latest restbase config in all nodes [16:17:03] !log jgiannelos@deploy2002 deploy aborted: Deploy latest restbase config in all nodes (duration: 00m 04s) [16:17:51] (03PS1) 10Hnowlan: kubernetes: migrate 5 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1004740 (https://phabricator.wikimedia.org/T351074) [16:17:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P57110 and previous config saved to /var/cache/conftool/dbconfig/20240219-161756-arnaudb.json [16:18:54] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [16:19:01] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 00m 07s) [16:19:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 4%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57111 and previous config saved to /var/cache/conftool/dbconfig/20240219-161914-arnaudb.json [16:20:26] (03CR) 10Hnowlan: [C: 03+1] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert) [16:21:26] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@e5ed8d0]: (no justification provided) [16:21:31] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@e5ed8d0]: (no justification provided) (duration: 00m 04s) [16:24:26] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513#9555609 (10ayounsi) p:05Triage→03Medium [16:24:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57112 and previous config saved to /var/cache/conftool/dbconfig/20240219-162453-root.json [16:26:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Thank you" [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [16:28:01] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: begin to scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004662 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:28:15] (03PS3) 10DLynch: Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) [16:28:54] (03Merged) 10jenkins-bot: mw-jobrunner: begin to scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004662 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:29:11] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [16:29:41] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [16:29:47] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:29:50] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@7e5e720]: Disable parsoid storage on all nodes [16:29:58] !log jgiannelos@deploy2002 deploy aborted: Disable parsoid storage on all nodes (duration: 00m 08s) [16:30:06] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@7e5e720]: Disable parsoid storage on all nodes [16:30:06] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [16:30:06] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1630). [16:30:11] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:30:11] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@7e5e720]: Disable parsoid storage on all nodes (duration: 00m 07s) [16:30:38] (03PS1) 10DLynch: EditAttemptStep: log buckets for the edit check test [extensions/WikimediaEvents] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004708 (https://phabricator.wikimedia.org/T342930) [16:31:11] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@7e5e720]: Disable parsoid storage on all nodes [16:31:12] (03PS1) 10DLynch: Enrollment for the edit check a/b test [extensions/VisualEditor] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004709 (https://phabricator.wikimedia.org/T342930) [16:33:48] !log jgiannelos@deploy2002 deploy aborted: Disable parsoid storage on all nodes (duration: 01m 57s) [16:33:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P57113 and previous config saved to /var/cache/conftool/dbconfig/20240219-163303-arnaudb.json [16:33:54] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@7e5e720]: Disable parsoid storage on restbase[2033:2035] [16:34:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 8%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57114 and previous config saved to /var/cache/conftool/dbconfig/20240219-163419-arnaudb.json [16:35:06] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@7e5e720]: Disable parsoid storage on restbase[2033:2035] (duration: 01m 19s) [16:36:23] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@7e5e720]: Disable parsoid storage on restbase[1031:1033] [16:44:16] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@7e5e720]: Disable parsoid storage on restbase[1031:1033] (duration: 01m 55s) [16:44:19] (03CR) 10Volans: "the idea LGTM, possible improvement inline. See PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [16:44:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57115 and previous config saved to /var/cache/conftool/dbconfig/20240219-163958-root.json [16:48:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T357189)', diff saved to https://phabricator.wikimedia.org/P57116 and previous config saved to /var/cache/conftool/dbconfig/20240219-164809-arnaudb.json [16:48:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:48:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:48:16] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.118s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:48:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:49:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 10%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57117 and previous config saved to /var/cache/conftool/dbconfig/20240219-164924-arnaudb.json [16:50:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:50:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [16:50:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2114 (T357189)', diff saved to https://phabricator.wikimedia.org/P57118 and previous config saved to /var/cache/conftool/dbconfig/20240219-165032-arnaudb.json [16:53:16] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.175s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:54:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T357189)', diff saved to https://phabricator.wikimedia.org/P57119 and previous config saved to /var/cache/conftool/dbconfig/20240219-165400-arnaudb.json [16:54:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:55:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57120 and previous config saved to /var/cache/conftool/dbconfig/20240219-165503-root.json [17:01:54] (03PS1) 10Cathal Mooney: Enable BGP session status change logs on l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1004747 [17:04:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 20%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57121 and previous config saved to /var/cache/conftool/dbconfig/20240219-170428-arnaudb.json [17:09:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P57122 and previous config saved to /var/cache/conftool/dbconfig/20240219-170906-arnaudb.json [17:09:43] (03CR) 10Volans: "Approach LGTM, few minor nits inline, it's time for the tests :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [17:16:26] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555767 (10cmooney) p:05Triage→03Medium [17:16:32] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555777 (10cmooney) [17:16:40] 10SRE, 10Infrastructure-Foundations, 10netops: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619#9555778 (10cmooney) [17:16:48] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555767 (10cmooney) [17:18:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.157s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:19:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 30%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57123 and previous config saved to /var/cache/conftool/dbconfig/20240219-171933-arnaudb.json [17:21:19] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796#9555793 (10hnowlan) [17:21:27] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796#9352065 (10hnowlan) [17:23:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.093s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:23:59] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555808 (10cmooney) [17:24:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P57124 and previous config saved to /var/cache/conftool/dbconfig/20240219-172412-arnaudb.json [17:25:11] (03CR) 10Volans: "The approach LGTM, few minor details inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:30:12] (03PS4) 10DLynch: Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) [17:30:14] (03PS1) 10DLynch: Default VE on mobile for other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004748 (https://phabricator.wikimedia.org/T352127) [17:33:22] (03PS1) 10Esanders: DiscussionTools: Remove no-op config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004749 [17:34:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2149 for maint', diff saved to https://phabricator.wikimedia.org/P57125 and previous config saved to /var/cache/conftool/dbconfig/20240219-173411-ladsgroup.json [17:34:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 40%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57126 and previous config saved to /var/cache/conftool/dbconfig/20240219-173438-arnaudb.json [17:38:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: recloning db2156 (T352010) [17:38:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: recloning db2156 (T352010) [17:38:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:39:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T357189)', diff saved to https://phabricator.wikimedia.org/P57127 and previous config saved to /var/cache/conftool/dbconfig/20240219-173919-arnaudb.json [17:39:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:39:24] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:39:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:39:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2117 (T357189)', diff saved to https://phabricator.wikimedia.org/P57128 and previous config saved to /var/cache/conftool/dbconfig/20240219-173941-arnaudb.json [17:41:54] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: migrate 5 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1004740 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [17:43:10] !log running `decommssion` for mw2312.codfw.wmnet,mw2313.codfw.wmnet,mw2367.codfw.wmnet,mw2369.codfw.wmnet,mw2384.codfw.wmnet,mw2385.codfw.wmnet before reimaging to k8s workers [17:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T357189)', diff saved to https://phabricator.wikimedia.org/P57129 and previous config saved to /var/cache/conftool/dbconfig/20240219-174347-arnaudb.json [17:47:44] (03CR) 10Alexandros Kosiaris: mw-parsoid: Introduce it (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004157 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [17:49:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 50%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57130 and previous config saved to /var/cache/conftool/dbconfig/20240219-174943-arnaudb.json [17:56:04] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2149.codfw.wmnet onto db2156.codfw.wmnet [17:58:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P57131 and previous config saved to /var/cache/conftool/dbconfig/20240219-175853-arnaudb.json [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1800) [18:00:04] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T1800). [18:00:17] (03CR) 10Hnowlan: [C: 03+2] kubernetes: migrate 5 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1004740 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [18:00:36] (03CR) 10Volans: "I did a first pass on the python file only" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [18:00:45] (03CR) 10Hnowlan: "On second thought, I'll merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1004740 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [18:04:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 75%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57132 and previous config saved to /var/cache/conftool/dbconfig/20240219-180448-arnaudb.json [18:14:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P57133 and previous config saved to /var/cache/conftool/dbconfig/20240219-181359-arnaudb.json [18:19:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3317 (re)pooling @ 100%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57134 and previous config saved to /var/cache/conftool/dbconfig/20240219-181953-arnaudb.json [18:19:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 1%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57135 and previous config saved to /var/cache/conftool/dbconfig/20240219-181958-arnaudb.json [18:21:07] (03PS1) 10Jaime Nuche: match `pip2` path used by common `run.sh` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) [18:29:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T357189)', diff saved to https://phabricator.wikimedia.org/P57136 and previous config saved to /var/cache/conftool/dbconfig/20240219-182905-arnaudb.json [18:29:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:29:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:29:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:29:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T357189)', diff saved to https://phabricator.wikimedia.org/P57137 and previous config saved to /var/cache/conftool/dbconfig/20240219-182929-arnaudb.json [18:33:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T357189)', diff saved to https://phabricator.wikimedia.org/P57138 and previous config saved to /var/cache/conftool/dbconfig/20240219-183341-arnaudb.json [18:35:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 2%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57139 and previous config saved to /var/cache/conftool/dbconfig/20240219-183503-arnaudb.json [18:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:48:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P57140 and previous config saved to /var/cache/conftool/dbconfig/20240219-184848-arnaudb.json [18:50:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 4%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57141 and previous config saved to /var/cache/conftool/dbconfig/20240219-185008-arnaudb.json [19:03:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P57142 and previous config saved to /var/cache/conftool/dbconfig/20240219-190354-arnaudb.json [19:05:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 8%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57143 and previous config saved to /var/cache/conftool/dbconfig/20240219-190513-arnaudb.json [19:14:35] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Yann" . # T357297 [19:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:50] T357297: Server-side upload request for Yann - https://phabricator.wikimedia.org/T357297 [19:19:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T357189)', diff saved to https://phabricator.wikimedia.org/P57144 and previous config saved to /var/cache/conftool/dbconfig/20240219-191901-arnaudb.json [19:19:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [19:19:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:19:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [19:19:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T357189)', diff saved to https://phabricator.wikimedia.org/P57145 and previous config saved to /var/cache/conftool/dbconfig/20240219-191923-arnaudb.json [19:20:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 10%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57146 and previous config saved to /var/cache/conftool/dbconfig/20240219-192018-arnaudb.json [19:23:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T357189)', diff saved to https://phabricator.wikimedia.org/P57147 and previous config saved to /var/cache/conftool/dbconfig/20240219-192327-arnaudb.json [19:23:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2149.codfw.wmnet onto db2156.codfw.wmnet [19:25:16] (03CR) 10Kamila Součková: [C: 03+1] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert) [19:28:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P57148 and previous config saved to /var/cache/conftool/dbconfig/20240219-192838-root.json [19:35:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 20%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57149 and previous config saved to /var/cache/conftool/dbconfig/20240219-193522-arnaudb.json [19:38:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P57150 and previous config saved to /var/cache/conftool/dbconfig/20240219-193834-arnaudb.json [19:40:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T352010)', diff saved to https://phabricator.wikimedia.org/P57151 and previous config saved to /var/cache/conftool/dbconfig/20240219-194056-ladsgroup.json [19:41:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:41:41] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --user="Yann" --overwrite . # T357218 [19:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:46] T357218: Server-side upload request for Yann - https://phabricator.wikimedia.org/T357218 [19:42:53] !log zabe@mwmaint2002:/tmp/uploads$ mwscript emptyUserGroup.php --wiki=testwiki reviewer # T356012 [19:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:58] T356012: Remove reviewer user group from testwiki - https://phabricator.wikimedia.org/T356012 [19:43:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P57152 and previous config saved to /var/cache/conftool/dbconfig/20240219-194343-root.json [19:50:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 30%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57153 and previous config saved to /var/cache/conftool/dbconfig/20240219-195028-arnaudb.json [19:53:16] (03PS1) 10Zabe: Remove reviewer group from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004768 (https://phabricator.wikimedia.org/T356012) [19:53:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P57154 and previous config saved to /var/cache/conftool/dbconfig/20240219-195341-arnaudb.json [19:53:54] (03PS2) 10Zabe: a-z ascending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000294 (owner: 10GergesShamon) [19:54:18] (03Abandoned) 10Zabe: a-z ascending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1000294 (owner: 10GergesShamon) [19:55:06] jouncebot: nowandnext [19:55:06] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [19:55:06] In 1 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T2100) [19:55:10] (03CR) 10Zabe: [C: 03+2] Remove reviewer group from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004768 (https://phabricator.wikimedia.org/T356012) (owner: 10Zabe) [19:55:55] (03Merged) 10jenkins-bot: Remove reviewer group from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004768 (https://phabricator.wikimedia.org/T356012) (owner: 10Zabe) [19:56:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57155 and previous config saved to /var/cache/conftool/dbconfig/20240219-195603-ladsgroup.json [19:56:16] !log zabe@deploy2002 Started scap: Backport for [[gerrit:1004768|Remove reviewer group from testwiki (T356012)]] [19:56:21] T356012: Remove reviewer user group from testwiki - https://phabricator.wikimedia.org/T356012 [19:57:40] !log zabe@deploy2002 zabe: Backport for [[gerrit:1004768|Remove reviewer group from testwiki (T356012)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:57:58] !log zabe@deploy2002 zabe: Continuing with sync [19:58:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P57156 and previous config saved to /var/cache/conftool/dbconfig/20240219-195848-root.json [20:05:32] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:1004768|Remove reviewer group from testwiki (T356012)]] (duration: 09m 16s) [20:05:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 40%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57157 and previous config saved to /var/cache/conftool/dbconfig/20240219-200533-arnaudb.json [20:05:47] T356012: Remove reviewer user group from testwiki - https://phabricator.wikimedia.org/T356012 [20:08:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T357189)', diff saved to https://phabricator.wikimedia.org/P57158 and previous config saved to /var/cache/conftool/dbconfig/20240219-200847-arnaudb.json [20:08:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [20:09:03] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:09:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [20:09:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:09:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:09:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57159 and previous config saved to /var/cache/conftool/dbconfig/20240219-200914-arnaudb.json [20:11:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57160 and previous config saved to /var/cache/conftool/dbconfig/20240219-201109-ladsgroup.json [20:13:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P57161 and previous config saved to /var/cache/conftool/dbconfig/20240219-201353-root.json [20:14:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57162 and previous config saved to /var/cache/conftool/dbconfig/20240219-201416-arnaudb.json [20:14:24] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:20:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 50%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57163 and previous config saved to /var/cache/conftool/dbconfig/20240219-202037-arnaudb.json [20:26:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T352010)', diff saved to https://phabricator.wikimedia.org/P57164 and previous config saved to /var/cache/conftool/dbconfig/20240219-202615-ladsgroup.json [20:26:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [20:26:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:26:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [20:26:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P57165 and previous config saved to /var/cache/conftool/dbconfig/20240219-202648-ladsgroup.json [20:29:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P57166 and previous config saved to /var/cache/conftool/dbconfig/20240219-202923-arnaudb.json [20:31:49] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9556205 (10cmooney) > We will need to arrange a window to push this out to the devices with service ops. I will discuss with them, but I think the easiest way forward may be to make... [20:35:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 75%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57167 and previous config saved to /var/cache/conftool/dbconfig/20240219-203542-arnaudb.json [20:44:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P57168 and previous config saved to /var/cache/conftool/dbconfig/20240219-204429-arnaudb.json [20:50:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2194:3316 (re)pooling @ 100%: Cloning to db2194 done', diff saved to https://phabricator.wikimedia.org/P57169 and previous config saved to /var/cache/conftool/dbconfig/20240219-205047-arnaudb.json [20:59:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57171 and previous config saved to /var/cache/conftool/dbconfig/20240219-205935-arnaudb.json [20:59:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [20:59:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:59:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T2100). [21:00:05] kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:52] o/ [21:01:50] I have two backports and two config patches. The backports just set things up for the config changes and won't do anything testable until those are also deployed. If you want to just merge the entire set and put them on a sync server, that'd probably the easiest way for me to test it. [21:01:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:02:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:02:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2171:3316 (T357189)', diff saved to https://phabricator.wikimedia.org/P57172 and previous config saved to /var/cache/conftool/dbconfig/20240219-210228-arnaudb.json [21:06:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T357189)', diff saved to https://phabricator.wikimedia.org/P57173 and previous config saved to /var/cache/conftool/dbconfig/20240219-210635-arnaudb.json [21:06:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:07:11] I can deploy [21:07:26] 😍 [21:07:50] (03CR) 10Zabe: [C: 03+2] EditAttemptStep: log buckets for the edit check test [extensions/WikimediaEvents] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004708 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:07:56] (03CR) 10Zabe: [C: 03+2] Enrollment for the edit check a/b test [extensions/VisualEditor] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004709 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:08:50] (03CR) 10Zabe: [C: 03+2] Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:08:52] (03CR) 10Zabe: [C: 03+2] Default VE on mobile for other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004748 (https://phabricator.wikimedia.org/T352127) (owner: 10DLynch) [21:09:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004708 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:09:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004709 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:09:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:09:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004748 (https://phabricator.wikimedia.org/T352127) (owner: 10DLynch) [21:09:38] (03Merged) 10jenkins-bot: Launch the Visual Editor edit check a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004351 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:09:41] (03Merged) 10jenkins-bot: Default VE on mobile for other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004748 (https://phabricator.wikimedia.org/T352127) (owner: 10DLynch) [21:10:07] (03Merged) 10jenkins-bot: EditAttemptStep: log buckets for the edit check test [extensions/WikimediaEvents] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004708 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:21:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P57174 and previous config saved to /var/cache/conftool/dbconfig/20240219-212141-arnaudb.json [21:24:47] (03Merged) 10jenkins-bot: Enrollment for the edit check a/b test [extensions/VisualEditor] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1004709 (https://phabricator.wikimedia.org/T342930) (owner: 10DLynch) [21:25:05] !log zabe@deploy2002 Started scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] [21:25:11] T342930: [MILESTONE] Run an A/B test to evaluate Edit Check (references) impact - https://phabricator.wikimedia.org/T342930 [21:25:12] T352127: [Config] Enable the mobile visual editor by default for the initial set of wikis - https://phabricator.wikimedia.org/T352127 [21:26:26] !log zabe@deploy2002 kemayo and zabe: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P57175 and previous config saved to /var/cache/conftool/dbconfig/20240219-212720-ladsgroup.json [21:27:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:29:50] Kemayo: can you test? [21:30:23] zabe: Sure thing, it'll just take me a minute. [21:30:34] sure:) [21:34:47] zabe: Okay, it seems to be working! [21:35:03] cool, syncing [21:35:06] !log zabe@deploy2002 kemayo and zabe: Continuing with sync [21:36:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P57176 and previous config saved to /var/cache/conftool/dbconfig/20240219-213648-arnaudb.json [21:42:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57177 and previous config saved to /var/cache/conftool/dbconfig/20240219-214227-ladsgroup.json [21:42:31] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:1004708|EditAttemptStep: log buckets for the edit check test (T342930)]], [[gerrit:1004709|Enrollment for the edit check a/b test (T342930)]], [[gerrit:1004351|Launch the Visual Editor edit check a/b test (T342930 T352127)]], [[gerrit:1004748|Default VE on mobile for other wikis (T352127)]] (duration: 17m 25s) [21:42:37] T342930: [MILESTONE] Run an A/B test to evaluate Edit Check (references) impact - https://phabricator.wikimedia.org/T342930 [21:42:37] T352127: [Config] Enable the mobile visual editor by default for the initial set of wikis - https://phabricator.wikimedia.org/T352127 [21:43:47] Kemayo: should be live [21:44:20] zabe: It does indeed seem to be. [21:51:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T357189)', diff saved to https://phabricator.wikimedia.org/P57178 and previous config saved to /var/cache/conftool/dbconfig/20240219-215155-arnaudb.json [21:51:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:52:01] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:52:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:52:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57179 and previous config saved to /var/cache/conftool/dbconfig/20240219-215217-arnaudb.json [21:55:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57180 and previous config saved to /var/cache/conftool/dbconfig/20240219-215534-arnaudb.json [21:57:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57181 and previous config saved to /var/cache/conftool/dbconfig/20240219-215733-ladsgroup.json [22:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240219T2200). [22:07:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:07:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:07] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P57182 and previous config saved to /var/cache/conftool/dbconfig/20240219-221041-arnaudb.json [22:12:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P57183 and previous config saved to /var/cache/conftool/dbconfig/20240219-221239-ladsgroup.json [22:12:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:13:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51451 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:24:08] (03PS1) 10Tim Starling: Set $wgLoginNotifyUseCheckUser = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) [22:25:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P57184 and previous config saved to /var/cache/conftool/dbconfig/20240219-222547-arnaudb.json [22:35:22] (03CR) 10Dreamy Jazz: [C: 03+1] "LGTM 90 days from 22 November 2023 (the date of merge) is tomorrow, so should be fine to merge at any point tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [22:40:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T357189)', diff saved to https://phabricator.wikimedia.org/P57185 and previous config saved to /var/cache/conftool/dbconfig/20240219-224054-arnaudb.json [22:40:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [22:40:59] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:41:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [22:41:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57186 and previous config saved to /var/cache/conftool/dbconfig/20240219-224117-arnaudb.json [22:46:20] (03CR) 10Ladsgroup: [C: 03+1] Set $wgLoginNotifyUseCheckUser = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [22:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:57:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57187 and previous config saved to /var/cache/conftool/dbconfig/20240219-225732-arnaudb.json [22:57:37] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:12:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P57188 and previous config saved to /var/cache/conftool/dbconfig/20240219-231238-arnaudb.json [23:17:14] (03CR) 10Tim Starling: "Today in my timezone. Deployment of the previous stage was at 2023-11-22 00:55 UTC. Add 90 days and you get 2024-02-20 00:55 UTC which is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [23:27:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P57189 and previous config saved to /var/cache/conftool/dbconfig/20240219-232745-arnaudb.json [23:42:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T357189)', diff saved to https://phabricator.wikimedia.org/P57190 and previous config saved to /var/cache/conftool/dbconfig/20240219-234251-arnaudb.json [23:42:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [23:42:57] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:43:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [23:51:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:52:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:54:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.351 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:55:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring