[00:00:05] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T0000). [00:00:05] nray: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:24] Hello o/ [00:00:35] Hey nray [00:00:39] I can deploy today [00:00:51] hey urbanecm . Thank you! [00:01:10] (03CR) 10Urbanecm: [C: 03+2] Fix TypeError: document.querySelectorAll(...).forEach is not a function [skins/Vector] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752766 (https://phabricator.wikimedia.org/T298910) (owner: 10Nray) [00:17:03] (03CR) 10Cwhite: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [00:17:36] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/752631 (owner: 10Muehlenhoff) [00:18:42] (03Merged) 10jenkins-bot: Fix TypeError: document.querySelectorAll(...).forEach is not a function [skins/Vector] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752766 (https://phabricator.wikimedia.org/T298910) (owner: 10Nray) [00:18:48] (03CR) 10Cwhite: [C: 03+1] kafka-logging: move to fixed UID/GID for kafka user [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) (owner: 10Herron) [00:20:04] nray: can you test at mwdebug1001 please? [00:20:14] yes testing now, thank you [00:20:24] thanks [00:21:44] (03CR) 10Cwhite: [C: 03+1] "My promtool executable is also located elsewhere in PATH. I tested this locally and it worked great. Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/752651 (owner: 10JMeybohm) [00:22:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:37] @urbanecm things look good. You can proceed [00:22:42] syncing [00:23:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:29] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.16/skins/Vector/resources/skins.vector.js/dropdownMenus.js: 79b33f2: Fix TypeError: document.querySelectorAll(...).forEach is not a function (T298910) (duration: 00m 59s) [00:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:31] T298910: TypeError: document.querySelectorAll(...).forEach is not a function - https://phabricator.wikimedia.org/T298910 [00:24:33] nray: and live [00:24:35] anything else? [00:24:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:43] that's it. thanks so much for your help! [00:24:54] any time [00:25:39] (03CR) 10Cwhite: "This change also includes a role reassignment from kibana7_ecs to logging::opensearch::collector. I propose we recreate role::kibana7_ecs" [puppet] - 10https://gerrit.wikimedia.org/r/752756 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [00:28:20] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10aaron) I like "mainstash". If there is ever vertical sharding by extension, then "stash" could be used as... [00:50:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:16:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:41] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:45:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:47:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T0200) [02:00:25] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:53] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.17 [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752802 [02:06:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.17 [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752802 (owner: 10TrainBranchBot) [02:07:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:13] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.17 [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752802 (owner: 10TrainBranchBot) [02:32:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:33:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:58] (03PS1) 10Andrew Bogott: All nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) [02:42:35] (03PS1) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) [02:45:53] (03PS2) 10Andrew Bogott: cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) [02:46:28] (03CR) 10jerkins-bot: [V: 04-1] cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [02:58:58] (03PS3) 10Andrew Bogott: cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) [03:10:22] (03PS4) 10Andrew Bogott: cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) [03:13:17] (03PS5) 10Andrew Bogott: cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) [03:14:18] (03CR) 10Andrew Bogott: [C: 03+2] cloudnfs: allow nfs-exportd to make public mounts actually public [puppet] - 10https://gerrit.wikimedia.org/r/752805 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [03:15:31] (03PS1) 10Andrew Bogott: nfs/add_server.py: one last puppet run after everthing is configured [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752810 (https://phabricator.wikimedia.org/T293800) [03:39:38] (03CR) 10Andrew Bogott: [C: 03+2] nfs/add_server.py: one last puppet run after everthing is configured [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752810 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [03:42:35] (03Merged) 10jenkins-bot: nfs/add_server.py: one last puppet run after everthing is configured [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/752810 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [04:44:53] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:44:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:44:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [05:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [05:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [05:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [05:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T297191)', diff saved to https://phabricator.wikimedia.org/P18503 and previous config saved to /var/cache/conftool/dbconfig/20220111-054417-marostegui.json [05:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:20] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [05:44:54] (03PS1) 10Marostegui: Revert "es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752767 [05:46:01] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:46:12] (03CR) 10Marostegui: [C: 03+2] Revert "es2032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752767 (owner: 10Marostegui) [05:49:57] (03PS1) 10Marostegui: Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752768 [05:51:22] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752768 (owner: 10Marostegui) [05:55:32] (03PS1) 10Marostegui: dbproxy1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752934 (https://phabricator.wikimedia.org/T298586) [05:56:17] (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752934 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [06:00:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1012.eqiad.wmnet with OS bullseye [06:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:16] (03PS1) 10Marostegui: drop_rev_page_id_T285149.py: Schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752935 (https://phabricator.wikimedia.org/T285149) [06:15:36] (03CR) 10Ladsgroup: [C: 03+1] "I reviewed this before." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752935 (https://phabricator.wikimedia.org/T285149) (owner: 10Marostegui) [06:17:01] (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_rev_page_id_T285149.py: Schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/752935 (https://phabricator.wikimedia.org/T285149) (owner: 10Marostegui) [06:18:33] Amir1: btw can I start the centralauth hidden_level migration script? [06:18:53] taavi: good morning, sure [06:19:09] just a screen session on mwmaint1002 is fine I guess? [06:19:20] (03PS1) 10Gergő Tisza: SECURITY: Fix several i18n XSS issues in suggested edits [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752769 (https://phabricator.wikimedia.org/T298504) [06:19:32] depends on how long it would take but screen is better [06:21:03] I honestly have no clue on how long it will take [06:21:30] !log starting extensions/CentralAuth/maintenance/migrateHiddenLevel.php on a mwmaint1002 screen session - T289068 [06:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:33] T289068: Normalise centralauth.gu_hidden - https://phabricator.wikimedia.org/T289068 [06:23:42] it's going really fast, but not telling me how many rows it is affecting [06:24:03] at least I don't see any replag on grafana [06:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es2032 after Bullseye reimage T295965', diff saved to https://phabricator.wikimedia.org/P18504 and previous config saved to /var/cache/conftool/dbconfig/20220111-062620-marostegui.json [06:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:24] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [06:27:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18505 and previous config saved to /var/cache/conftool/dbconfig/20220111-062743-root.json [06:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1012.eqiad.wmnet with OS bullseye [06:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [06:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18506 and previous config saved to /var/cache/conftool/dbconfig/20220111-063052-marostegui.json [06:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:55] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [06:32:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18507 and previous config saved to /var/cache/conftool/dbconfig/20220111-063207-marostegui.json [06:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:59] (03PS4) 10ArielGlenn: Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [06:33:45] (03PS1) 10Gergő Tisza: Strip comments from indicators [extensions/PageImages] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752770 (https://phabricator.wikimedia.org/T298930) [06:34:23] (03CR) 10Legoktm: [C: 03+1] "PS4 changes LGTM!" [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [06:34:39] (03CR) 10ArielGlenn: "Sorry about that, forgot to actually git add the file with the small changes.Done now." [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [06:37:10] will do some backports [06:41:46] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:42:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18508 and previous config saved to /var/cache/conftool/dbconfig/20220111-064247-root.json [06:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:00] (03CR) 10jerkins-bot: [V: 04-1] SECURITY: Fix several i18n XSS issues in suggested edits [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752769 (https://phabricator.wikimedia.org/T298504) (owner: 10Gergő Tisza) [06:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18509 and previous config saved to /var/cache/conftool/dbconfig/20220111-064712-marostegui.json [06:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:25] (03PS1) 10Marostegui: Revert "dbproxy1012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752771 [06:50:25] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/752771 (owner: 10Marostegui) [06:50:47] !log upgrading mysql on ['db2114', 'db2117', 'db2124'] [06:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T296143)', diff saved to https://phabricator.wikimedia.org/P18510 and previous config saved to /var/cache/conftool/dbconfig/20220111-065118-ladsgroup.json [06:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:21] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:51:22] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db2114.codfw.wmnet [06:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:07] I put the wrong ticket [06:55:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2114.codfw.wmnet [06:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T296143)', diff saved to https://phabricator.wikimedia.org/P18511 and previous config saved to /var/cache/conftool/dbconfig/20220111-065640-ladsgroup.json [06:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:43] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [06:57:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18512 and previous config saved to /var/cache/conftool/dbconfig/20220111-065750-root.json [06:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18513 and previous config saved to /var/cache/conftool/dbconfig/20220111-070216-marostegui.json [07:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] (03PS1) 10Marostegui: wmnet: Failover m2 master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/752936 (https://phabricator.wikimedia.org/T298586) [07:07:37] !log Failover m2 proxy from dbproxy1015 to dbproxy1013 T298586 [07:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:41] T298586: Upgrade all dbproxy hosts to Bullseye - https://phabricator.wikimedia.org/T298586 [07:07:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2 master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/752936 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [07:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P18514 and previous config saved to /var/cache/conftool/dbconfig/20220111-071144-ladsgroup.json [07:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:33] !log extensions/CentralAuth/maintenance/migrateHiddenLevel.php finished - T289068 [07:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:35] T289068: Normalise centralauth.gu_hidden - https://phabricator.wikimedia.org/T289068 [07:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1144:3315 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18515 and previous config saved to /var/cache/conftool/dbconfig/20220111-071254-root.json [07:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18516 and previous config saved to /var/cache/conftool/dbconfig/20220111-071721-marostegui.json [07:17:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:17:25] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T297191)', diff saved to https://phabricator.wikimedia.org/P18517 and previous config saved to /var/cache/conftool/dbconfig/20220111-071729-marostegui.json [07:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P18518 and previous config saved to /var/cache/conftool/dbconfig/20220111-072649-ladsgroup.json [07:26:50] (03CR) 10Gergő Tisza: [C: 03+2] Strip comments from indicators [extensions/PageImages] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752770 (https://phabricator.wikimedia.org/T298930) (owner: 10Gergő Tisza) [07:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297191)', diff saved to https://phabricator.wikimedia.org/P18519 and previous config saved to /var/cache/conftool/dbconfig/20220111-072847-marostegui.json [07:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:50] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T296143)', diff saved to https://phabricator.wikimedia.org/P18520 and previous config saved to /var/cache/conftool/dbconfig/20220111-074154-ladsgroup.json [07:41:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:41:58] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T296143)', diff saved to https://phabricator.wikimedia.org/P18521 and previous config saved to /var/cache/conftool/dbconfig/20220111-074202-ladsgroup.json [07:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:05] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db2117.codfw.wmnet [07:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18522 and previous config saved to /var/cache/conftool/dbconfig/20220111-074351-marostegui.json [07:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2117.codfw.wmnet [07:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:33] (03Merged) 10jenkins-bot: Strip comments from indicators [extensions/PageImages] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752770 (https://phabricator.wikimedia.org/T298930) (owner: 10Gergő Tisza) [07:48:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T296143)', diff saved to https://phabricator.wikimedia.org/P18523 and previous config saved to /var/cache/conftool/dbconfig/20220111-074800-ladsgroup.json [07:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:03] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:53:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:29] (03CR) 10Gergő Tisza: [C: 03+2] "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752769 (https://phabricator.wikimedia.org/T298504) (owner: 10Gergő Tisza) [07:53:48] (03PS1) 10Marostegui: dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752989 (https://phabricator.wikimedia.org/T298586) [07:54:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:54:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:43] (03CR) 10Marostegui: [C: 03+2] dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752989 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [07:55:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1020.eqiad.wmnet with OS bullseye [07:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18524 and previous config saved to /var/cache/conftool/dbconfig/20220111-075856-marostegui.json [07:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:34] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-02-10 08:02:21 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:03:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P18525 and previous config saved to /var/cache/conftool/dbconfig/20220111-080305-ladsgroup.json [08:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297191)', diff saved to https://phabricator.wikimedia.org/P18526 and previous config saved to /var/cache/conftool/dbconfig/20220111-081400-marostegui.json [08:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:14:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:14:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [08:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [08:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18527 and previous config saved to /var/cache/conftool/dbconfig/20220111-081442-marostegui.json [08:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18528 and previous config saved to /var/cache/conftool/dbconfig/20220111-081557-marostegui.json [08:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:46] (03CR) 10Ema: [C: 03+1] "Congrats on cluster id 100 \o/" [puppet] - 10https://gerrit.wikimedia.org/r/752146 (owner: 10Ssingh) [08:18:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P18529 and previous config saved to /var/cache/conftool/dbconfig/20220111-081809-ladsgroup.json [08:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:13] (03Merged) 10jenkins-bot: SECURITY: Fix several i18n XSS issues in suggested edits [extensions/GrowthExperiments] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752769 (https://phabricator.wikimedia.org/T298504) (owner: 10Gergő Tisza) [08:21:24] 10SRE, 10Analytics-Radar, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [08:22:26] (03PS4) 10Elukey: varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) [08:24:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1020.eqiad.wmnet with OS bullseye [08:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33181/console" [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:25:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18530 and previous config saved to /var/cache/conftool/dbconfig/20220111-083102-marostegui.json [08:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:14] (03CR) 10Marostegui: [C: 03+1] auto_schema: Force depool in codfw for mysql upgrades [software] - 10https://gerrit.wikimedia.org/r/752700 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [08:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T296143)', diff saved to https://phabricator.wikimedia.org/P18531 and previous config saved to /var/cache/conftool/dbconfig/20220111-083314-ladsgroup.json [08:33:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [08:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [08:33:18] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T296143)', diff saved to https://phabricator.wikimedia.org/P18532 and previous config saved to /var/cache/conftool/dbconfig/20220111-083322-ladsgroup.json [08:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:26] !log ladsgroup@cumin1001 START - Cookbook sre.mysql.upgrade for db2124.codfw.wmnet [08:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:15] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Force depool in codfw for mysql upgrades [software] - 10https://gerrit.wikimedia.org/r/752700 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [08:34:47] (03Merged) 10jenkins-bot: auto_schema: Force depool in codfw for mysql upgrades [software] - 10https://gerrit.wikimedia.org/r/752700 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [08:39:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2124.codfw.wmnet [08:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2023.codfw.wmnet with OS buster [08:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T296143)', diff saved to https://phabricator.wikimedia.org/P18533 and previous config saved to /var/cache/conftool/dbconfig/20220111-084151-ladsgroup.json [08:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:54] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:42:12] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:14] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:09] (03PS1) 10Marostegui: db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752993 (https://phabricator.wikimedia.org/T295965) [08:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18534 and previous config saved to /var/cache/conftool/dbconfig/20220111-084606-marostegui.json [08:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2078.codfw.wmnet with OS bullseye [08:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:33] (03CR) 10Marostegui: [C: 03+2] db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/752993 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [08:51:51] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:52:05] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:52:43] ^ me [08:53:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:53:18] ACKNOWLEDGEMENT - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P18535 and previous config saved to /var/cache/conftool/dbconfig/20220111-085656-ladsgroup.json [08:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18536 and previous config saved to /var/cache/conftool/dbconfig/20220111-090111-marostegui.json [09:01:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:01:15] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [09:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T297191)', diff saved to https://phabricator.wikimedia.org/P18537 and previous config saved to /var/cache/conftool/dbconfig/20220111-090119-marostegui.json [09:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297191)', diff saved to https://phabricator.wikimedia.org/P18538 and previous config saved to /var/cache/conftool/dbconfig/20220111-090732-marostegui.json [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:35] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [09:09:38] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:11:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2023.codfw.wmnet with OS buster [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P18539 and previous config saved to /var/cache/conftool/dbconfig/20220111-091201-ladsgroup.json [09:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:10] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:12:22] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:13:10] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:13:43] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1002.eqiad.wmnet [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:55] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1002.eqiad.wmnet [09:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2078.codfw.wmnet with OS bullseye [09:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18540 and previous config saved to /var/cache/conftool/dbconfig/20220111-092236-marostegui.json [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:22] !log Upgrading Jenkins and Apache on releases1002 & release2002 [09:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:50] !log cp4021 (upload), cp4027 (text): upgrade varnish to 6.0.9-1wm1 T298758 [09:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:52] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [09:25:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2019.codfw.wmnet with OS buster [09:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T296143)', diff saved to https://phabricator.wikimedia.org/P18541 and previous config saved to /var/cache/conftool/dbconfig/20220111-092706-ladsgroup.json [09:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:09] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [09:29:10] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubemaster1001.eqiad.wmnet [09:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:14] !log jayme@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster1001.eqiad.wmnet [09:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:46] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10JMeybohm) [09:33:57] (03PS2) 10Cparle: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752701 (https://phabricator.wikimedia.org/T297484) [09:35:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubemaster1001.eqiad.wmnet [09:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18542 and previous config saved to /var/cache/conftool/dbconfig/20220111-093741-marostegui.json [09:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:13] !log jayme@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster1001.eqiad.wmnet [09:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:58] (03PS2) 10JMeybohm: Use promtool in PATH rather than /usr/bin/promtool [alerts] - 10https://gerrit.wikimedia.org/r/752651 [09:45:42] (03PS1) 10Jcrespo: mediabackup: Backup testcommonswiki on codfw [puppet] - 10https://gerrit.wikimedia.org/r/752996 (https://phabricator.wikimedia.org/T262668) [09:46:00] (03PS2) 10Arturo Borrero Gonzalez: wmcs: GridConfigurator: run puppet agent in the master node when reconfiguring [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749739 [09:50:24] (03CR) 10JMeybohm: [C: 03+2] Use promtool in PATH rather than /usr/bin/promtool [alerts] - 10https://gerrit.wikimedia.org/r/752651 (owner: 10JMeybohm) [09:50:29] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:51:13] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:26] (03Merged) 10jenkins-bot: Use promtool in PATH rather than /usr/bin/promtool [alerts] - 10https://gerrit.wikimedia.org/r/752651 (owner: 10JMeybohm) [09:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297191)', diff saved to https://phabricator.wikimedia.org/P18543 and previous config saved to /var/cache/conftool/dbconfig/20220111-095246-marostegui.json [09:52:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:52:50] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [09:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18544 and previous config saved to /var/cache/conftool/dbconfig/20220111-095254-marostegui.json [09:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:33] (03PS2) 10Cparle: Enable support for references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752599 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [09:54:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: GridConfigurator: run puppet agent in the master node when reconfiguring [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749739 (owner: 10Arturo Borrero Gonzalez) [09:54:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18545 and previous config saved to /var/cache/conftool/dbconfig/20220111-095408-marostegui.json [09:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2019.codfw.wmnet with OS buster [09:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:14] (03PS2) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) [09:55:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] role::mariadb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:56:58] (03Merged) 10jenkins-bot: wmcs: GridConfigurator: run puppet agent in the master node when reconfiguring [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749739 (owner: 10Arturo Borrero Gonzalez) [09:58:33] (03PS1) 10Ayounsi: Add msw2-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/753000 [09:58:43] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [09:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:55] (03PS1) 10Elukey: bigtop: move our internal APT repo config to Buster [puppet] - 10https://gerrit.wikimedia.org/r/753002 [10:02:51] (03CR) 10Muehlenhoff: bigtop: move our internal APT repo config to Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753002 (owner: 10Elukey) [10:09:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18546 and previous config saved to /var/cache/conftool/dbconfig/20220111-100917-marostegui.json [10:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:58] (03PS1) 10Muehlenhoff: Update repo config for Bigtop to buster [puppet] - 10https://gerrit.wikimedia.org/r/753004 [10:14:25] (03CR) 10Matthias Mullie: [C: 03+2] Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752701 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [10:15:02] (03CR) 10Elukey: [C: 03+1] Update repo config for Bigtop to buster [puppet] - 10https://gerrit.wikimedia.org/r/753004 (owner: 10Muehlenhoff) [10:15:22] (03Abandoned) 10Elukey: bigtop: move our internal APT repo config to Buster [puppet] - 10https://gerrit.wikimedia.org/r/753002 (owner: 10Elukey) [10:16:38] (03CR) 10Jcrespo: "I heard this class or something similar was used or used to be used on cloud (VPS, not production) instances. This doesn't affect me, but " [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:20:22] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to create an exec node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [10:23:42] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Backup testcommonswiki on codfw [puppet] - 10https://gerrit.wikimedia.org/r/752996 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [10:24:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18547 and previous config saved to /var/cache/conftool/dbconfig/20220111-102421-marostegui.json [10:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:12] (03PS3) 10Btullis: Exclude log4j_extras from the classpath for coordinators [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) [10:26:59] (03CR) 10Kormat: [C: 03+1] role::mariadb::proxy: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751726 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:28:05] (03CR) 10Kormat: [C: 03+1] role::mariadb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:28:19] (03PS1) 10Btullis: Mark reedy as kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/753007 (https://phabricator.wikimedia.org/T298951) [10:29:14] (03CR) 10David Caro: [C: 03+2] role::mariadb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:29:30] (03CR) 10David Caro: [C: 03+2] role::mariadb::proxy: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751726 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:30:12] (03CR) 10Reedy: [C: 03+1] Mark reedy as kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/753007 (https://phabricator.wikimedia.org/T298951) (owner: 10Btullis) [10:30:20] (03CR) 10Muehlenhoff: [C: 03+2] Update repo config for Bigtop to buster [puppet] - 10https://gerrit.wikimedia.org/r/753004 (owner: 10Muehlenhoff) [10:30:44] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:30:44] dcaro: merging your patches along, ok? [10:31:06] moritzm: sure [10:31:08] just went to log in [10:31:10] thanks [10:31:19] ack, done [10:32:59] (03CR) 10David Caro: role::mariadb: remove unused role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:39:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T297191)', diff saved to https://phabricator.wikimedia.org/P18548 and previous config saved to /var/cache/conftool/dbconfig/20220111-103927-marostegui.json [10:39:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:39:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T297191)', diff saved to https://phabricator.wikimedia.org/P18549 and previous config saved to /var/cache/conftool/dbconfig/20220111-103941-marostegui.json [10:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:10] (03PS1) 10Muehlenhoff: Fixup bigtop sync and repository section [puppet] - 10https://gerrit.wikimedia.org/r/753008 [10:40:34] (03CR) 10Btullis: "Looks OK to me." [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [10:41:16] (03CR) 10Btullis: [C: 03+2] Mark reedy as kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/753007 (https://phabricator.wikimedia.org/T298951) (owner: 10Btullis) [10:42:07] (03CR) 10Muehlenhoff: [C: 03+2] Fixup bigtop sync and repository section [puppet] - 10https://gerrit.wikimedia.org/r/753008 (owner: 10Muehlenhoff) [10:44:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33182/console" [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:46:31] (03CR) 10Majavah: kerberos: manage users with custom puppet type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751100 (https://phabricator.wikimedia.org/T292389) (owner: 10Majavah) [10:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297191)', diff saved to https://phabricator.wikimedia.org/P18550 and previous config saved to /var/cache/conftool/dbconfig/20220111-104654-marostegui.json [10:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:58] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [10:48:19] (03PS1) 10David Caro: wmcs::db: remove used roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/753010 (https://phabricator.wikimedia.org/T272559) [10:51:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) > Only exception is the CR links which will be LC-SC to land on the patch panel. I should clarify that if we pre-cable the patch... [10:53:18] (03CR) 10David Caro: "Somehow I forgot to send this patch before xd" [puppet] - 10https://gerrit.wikimedia.org/r/753010 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:00:28] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) With the amount of information provided by Dell, we can't reliably tell. PERC controllers are rebranded Broadcom controllers, but there's no statement to which Broadcom controller PERC... [11:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18551 and previous config saved to /var/cache/conftool/dbconfig/20220111-110159-marostegui.json [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:28] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Marostegui) Thanks @MoritzMuehlenhoff - if this is only available from Bullseye, I think that's fine from the DB point of view. We are almost finishing our Bullseye testing and I nothing changes dramatic... [11:14:27] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:16:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend logstash Cumin alias with new Opensearch roles [puppet] - 10https://gerrit.wikimedia.org/r/752631 (owner: 10Muehlenhoff) [11:16:26] (03PS2) 10Hnowlan: maps: Install s3 client cli/lib [puppet] - 10https://gerrit.wikimedia.org/r/746929 (owner: 10Jgiannelos) [11:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18553 and previous config saved to /var/cache/conftool/dbconfig/20220111-111704-marostegui.json [11:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:35] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:19:39] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:20:15] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:20:17] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:20:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [11:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:37] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:21:37] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:22:41] (03CR) 10Ssingh: [C: 03+2] hieradata: add durum cluster [puppet] - 10https://gerrit.wikimedia.org/r/752146 (owner: 10Ssingh) [11:23:11] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) For the record, I'm taking care of this release, and given I am annoyed at how we manage image versions for shellbox, I'm also slightly modifying the procedure. I'... [11:23:45] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18554 and previous config saved to /var/cache/conftool/dbconfig/20220111-112514-root.json [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:13] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:57] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:22] (03PS2) 10Cathal Mooney: admins: add jvargas to ldap_only_admins, added to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/752725 (https://phabricator.wikimedia.org/T298719) (owner: 10Dzahn) [11:28:13] (03CR) 10Cathal Mooney: [C: 03+2] "Thanks Daniel! Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/752725 (https://phabricator.wikimedia.org/T298719) (owner: 10Dzahn) [11:30:02] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:32:00] (03CR) 10Hnowlan: [C: 03+2] maps: Install s3 client cli/lib [puppet] - 10https://gerrit.wikimedia.org/r/746929 (owner: 10Jgiannelos) [11:32:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297191)', diff saved to https://phabricator.wikimedia.org/P18555 and previous config saved to /var/cache/conftool/dbconfig/20220111-113208-marostegui.json [11:32:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:32:12] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T297191)', diff saved to https://phabricator.wikimedia.org/P18556 and previous config saved to /var/cache/conftool/dbconfig/20220111-113216-marostegui.json [11:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10cmooney) a:03cmooney Thanks Daniel for all the work on this, patch is now merged. @JVargas is all good from your point of view? Otherwise I will proceed to close this... [11:35:13] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:35:35] Hello, I have a issue with Gerrit, again. [11:35:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [11:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:57] On git fetch for mediawiki/core. [11:35:59] fetch-pack: unexpected disconnect while reading sideband packet [11:35:59] fatal: early EOF [11:35:59] Connection to gerrit.wikimedia.org closed by remote host. [11:35:59] fatal: fetch-pack: invalid index-pack output [11:36:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297191)', diff saved to https://phabricator.wikimedia.org/P18557 and previous config saved to /var/cache/conftool/dbconfig/20220111-113628-marostegui.json [11:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18558 and previous config saved to /var/cache/conftool/dbconfig/20220111-114018-root.json [11:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [11:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:03] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:47:30] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:51:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18559 and previous config saved to /var/cache/conftool/dbconfig/20220111-115133-marostegui.json [11:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:26] (03PS1) 10Awight: Allow aliases to be integers in addition to strings [extensions/TemplateData] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752775 (https://phabricator.wikimedia.org/T298795) [11:55:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18560 and previous config saved to /var/cache/conftool/dbconfig/20220111-115522-root.json [11:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:02] (03PS1) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [11:56:04] !log rebalance ganeti row A (all nodes reimaged to Buster) [11:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:38] (03CR) 10jerkins-bot: [V: 04-1] profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [11:58:21] (03PS2) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [11:59:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33184/console" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [11:59:24] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Aklapper) IIUC this isn't complete yet per items in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1200). [12:00:04] cormacparle and matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] * cormacparle waves [12:00:12] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [12:00:14] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [12:00:16] o/ [12:00:16] (03PS1) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [12:00:20] * urbanecm waves [12:00:24] the first backport has i18n changes, I’m not sure how to deploy that [12:00:30] (03PS3) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [12:00:32] !log reverting kubetcd2004.codfw.wmnet back to "plain" storage [12:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:46] Lucas_WMDE: you'd need scap sync-world, which we normally try to avoid [12:00:55] that’s what I feared [12:01:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33185/console" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:01:09] cormacparle: matthiasmullie: are the i18n backports really necessary? They take significant time to be done, so that's why I'm asking [12:01:18] (we can still done if urgent, but I'd like to know the answer to "why") [12:01:22] *do it if urgent [12:01:50] and to avoid confusion... [12:01:53] I can deploy today [12:02:01] we're migrating a preference, and what used to be a checkbox is now a dropdown [12:02:12] so yeah we kinda do need the i18n change [12:02:27] and it needs to be done in a backport so we can run the maint script to do the migration [12:02:37] sounds like a good enough reason to me [12:02:47] (03CR) 10Urbanecm: [C: 03+2] Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751836 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [12:03:00] (03CR) 10Urbanecm: [C: 03+2] Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752701 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [12:03:06] (03CR) 10Awight: [V: 03+1 C: 03+1] "Works locally." [extensions/TemplateData] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752775 (https://phabricator.wikimedia.org/T298795) (owner: 10Awight) [12:03:12] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:03:20] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:35] cormacparle: does the config depend on the backports in some way? [12:03:48] nope, different thing [12:04:29] I mean - the config change is for a completely different issure [12:04:34] issue [12:04:35] cormacparle: i see. I also see you're a deployer -- want to do the config yourself? [12:04:44] sure [12:05:00] go ahead then :) [12:05:04] the 2 backport changes are in a chain btw, so there's only one sync required [12:05:13] cool, doing that now [12:05:41] cormacparle: yeah. Those will need special-treatment due to the i18n changes being done, so we'd need to sync everything via the sync-world command [12:05:56] (03PS1) 10Elukey: aptrepo: change settings for the Bigtop repository [puppet] - 10https://gerrit.wikimedia.org/r/753019 [12:06:06] sorry :( [12:06:30] not your fault :) [12:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P18561 and previous config saved to /var/cache/conftool/dbconfig/20220111-120638-marostegui.json [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:41] (03CR) 10Cparle: [C: 03+2] Enable support for references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752599 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [12:07:45] cormacparle: ping me if you need any help with the config patch deployment. [12:08:00] will do thanks urbanecm [12:08:42] (03Merged) 10jenkins-bot: Enable support for references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752599 (https://phabricator.wikimedia.org/T230315) (owner: 10Matthias Mullie) [12:09:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/753019 (owner: 10Elukey) [12:10:00] (03CR) 10Elukey: [C: 03+2] aptrepo: change settings for the Bigtop repository [puppet] - 10https://gerrit.wikimedia.org/r/753019 (owner: 10Elukey) [12:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P18562 and previous config saved to /var/cache/conftool/dbconfig/20220111-121025-root.json [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:42] (03PS4) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [12:11:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33186/console" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:12:51] (03CR) 10Jbond: [V: 03+1] profile::installserver::proxy: update suiqd template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:13:12] (03PS5) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [12:14:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubetcd2004.codfw.wmnet with reason: switch to plain disk storage [12:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubetcd2004.codfw.wmnet with reason: switch to plain disk storage [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:26] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:49] config change fine on debug, syncing now [12:15:51] !log cparle@deploy1002 Synchronized wmf-config: Config: [[gerrit:752599|Enable support for references (T230315)]] (duration: 01m 00s) [12:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:54] T230315: [XL] Create a way to see and add references to structured data on Commons (MediaInfo) statements - https://phabricator.wikimedia.org/T230315 [12:15:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:12] (03PS1) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 [12:16:14] (03PS1) 10Giuseppe Lavagetto: shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) [12:16:42] cormacparle: i take it that we're waiting at the CI now [12:16:57] will you want to try the backports too (via the sync-world command)? [12:17:02] (I'm happy to do it for you, just asking) [12:17:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:33] happy to do it, but never used sync-world before [12:17:45] cormacparle: then i'll guide you :) [12:17:53] the start is very similar to normal deployments [12:18:00] (fetch to depoyment, scap pull on a debug server) [12:18:05] 10SRE, 10wikitech.wikimedia.org, 10Sustainability (Incident Followup), 10User-LSobanski: Incident response tools operational readiness review - https://phabricator.wikimedia.org/T290130 (10LSobanski) a:05LSobanski→03None [12:18:22] but, at the debug server, i18n changes probably will not work (you'll either see or the outdated message) [12:18:42] ok cool, will do that for a start anyway (config patch is now synced and seems fine) [12:19:14] once it merges and you tested it, ping me, and I'll share the rest :)) [12:19:35] will do! [12:20:54] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Allow aliases to be integers in addition to strings [extensions/TemplateData] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/752775 (https://phabricator.wikimedia.org/T298795) (owner: 10Awight) [12:21:09] (03Merged) 10jenkins-bot: Update the way the search interface is set [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751836 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [12:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297191)', diff saved to https://phabricator.wikimedia.org/P18563 and previous config saved to /var/cache/conftool/dbconfig/20220111-122143-marostegui.json [12:21:45] (03Merged) 10jenkins-bot: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752701 (https://phabricator.wikimedia.org/T297484) (owner: 10Cparle) [12:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:47] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [12:21:53] urbanecm: cormacparle: hi, can you please ping me when done deploying? I have a few patches of my own [12:22:02] sure [12:22:09] taavi: sure :) [12:22:14] (also, hi) [12:26:41] cormacparle: how are the tests going? 🙂 [12:26:53] cormacparle: since there's a maint script that needs to be run, I guess you may want to do that before sync-world as well? [12:27:19] oh [12:27:21] yes indeed [12:27:52] erm ... [12:27:54] do note that sync-world can take up to 40 minutes to complete (normally it finishes within 20) [12:28:01] https://www.irccloud.com/pastebin/LBdeJADJ/ [12:28:11] (03PS1) 10Jelto: deployment_server,::helm: remove helm2 support [puppet] - 10https://gerrit.wikimedia.org/r/753026 (https://phabricator.wikimedia.org/T251305) [12:28:14] there are new commits in other extensions ... [12:28:19] that's not what I expected [12:28:20] cormacparle: that's security patches [12:28:23] ignore it [12:28:26] kk cool [12:28:47] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [12:29:15] the runtime of sync-world means there might be nearly an hour during which the code needs to support both versions (old and new) [12:29:24] is that...fine? cormacparle [12:29:31] yep [12:29:50] okay, good [12:32:29] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10cmooney) p:05Triage→03Medium a:03cmooney [12:35:35] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [12:35:37] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [12:35:39] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [12:36:57] (03CR) 10Ayounsi: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:37:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:49] (03PS1) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [12:38:22] (03PS2) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [12:38:44] (03PS2) 10Jelto: deployment_server,::helm: remove helm2 support [puppet] - 10https://gerrit.wikimedia.org/r/753026 (https://phabricator.wikimedia.org/T251305) [12:39:25] (03PS3) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [12:40:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33189/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:41:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:41:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:15] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:41:32] (03CR) 10Jbond: [C: 04-1] "This is an example change applying a whitelist to the proxy, going to -1 this for now just to make sure it dosn't get accidentally merged " [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:42:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33190/console" [puppet] - 10https://gerrit.wikimedia.org/r/753026 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:44:36] (03PS6) 10Jbond: profile::installserver::proxy: update suiqd template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) [12:44:55] (03CR) 10Jbond: profile::installserver::proxy: update suiqd template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [12:44:57] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:45:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] (03PS4) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [12:47:03] (03CR) 10EllenR: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [12:49:55] cormacparle: how's the testing going? :)) [12:50:23] problem with the maint script :( [12:51:03] cormacparle: which kind of a problem? [12:51:25] can't write the data - there are already duplicates in the db we didn't know about [12:51:30] I think we'll have to revert both of those patches, because without the maint script they break the interface [12:51:34] then we need to revert [12:51:46] yeah [12:51:54] (03PS1) 10Urbanecm: Revert "Updated maint script to use fewer queries" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752776 (https://phabricator.wikimedia.org/T297484) [12:52:01] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "revert" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752776 (https://phabricator.wikimedia.org/T297484) (owner: 10Urbanecm) [12:52:08] (03PS1) 10Urbanecm: Revert "Update the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752777 (https://phabricator.wikimedia.org/T297484) [12:52:16] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "revert" [extensions/MediaSearch] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/752777 (https://phabricator.wikimedia.org/T297484) (owner: 10Urbanecm) [12:52:29] cormacparle: done [12:52:35] (config left live, as you said it's a diff thing) [12:52:44] yes config is fine [12:52:59] excellent thanks very much! we'll try again tomorrow [12:53:17] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:53:20] good luck in resolving the data inconsistency issue :) [12:53:39] DB migrations are one of the things that no one notices if done correctly and everyone notices if an error happens :/ [12:54:00] haha indeed! [12:58:57] 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): incidents occurring during Q2 and Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [12:59:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM planet1002.eqiad.wmnet [12:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM planet1002.eqiad.wmnet [13:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:23] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM people1003.eqiad.wmnet [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM people1003.eqiad.wmnet [13:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:49] (03CR) 10Hnowlan: [C: 03+2] Disable tilerator in all envs maps are deployed [puppet] - 10https://gerrit.wikimedia.org/r/752145 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos) [13:24:01] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [13:24:03] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [13:24:05] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [13:24:07] (03PS2) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [13:24:17] (03CR) 10Btullis: [V: 03+1] Exclude log4j_extras from the classpath for coordinators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [13:24:27] ACKNOWLEDGEMENT - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:27] ACKNOWLEDGEMENT - tilerator on maps1006 is CRITICAL: connect to address 10.64.0.18 and port 6534: Connection refused Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:24:27] ACKNOWLEDGEMENT - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:27] ACKNOWLEDGEMENT - tilerator on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 6534: Connection refused Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:24:28] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:28] ACKNOWLEDGEMENT - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:26:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [13:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [13:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [13:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [13:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T297191)', diff saved to https://phabricator.wikimedia.org/P18564 and previous config saved to /var/cache/conftool/dbconfig/20220111-132627-marostegui.json [13:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:30] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [13:27:29] ACKNOWLEDGEMENT - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:29] ACKNOWLEDGEMENT - tilerator on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6534: Connection refused Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:27:30] ACKNOWLEDGEMENT - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:30] ACKNOWLEDGEMENT - tilerator on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 6534: Connection refused Hnowlan tilerator disabled intentionally https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T297191)', diff saved to https://phabricator.wikimedia.org/P18565 and previous config saved to /var/cache/conftool/dbconfig/20220111-132734-marostegui.json [13:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:51] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [13:27:53] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbooks to create each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [13:27:55] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [13:27:59] (03PS3) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [13:29:19] ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator disabled - T298246 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:19] ACKNOWLEDGEMENT - tilerator on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6534: Connection refused Hnowlan Tilerator disabled - T298246 https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [13:29:56] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] (03CR) 10Kormat: [C: 03+1] wmcs::db: remove used roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/753010 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:33:55] !log installing 4.9.290 kernels von stretch systems (no reboots yet) [13:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:56] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P18567 and previous config saved to /var/cache/conftool/dbconfig/20220111-134239-marostegui.json [13:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [13:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:51] (03PS1) 10Marostegui: dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753041 (https://phabricator.wikimedia.org/T298586) [13:49:39] (03PS2) 10Muehlenhoff: Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146 [13:50:02] (03CR) 10Marostegui: [C: 03+2] dbproxy1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753041 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [13:50:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1021.eqiad.wmnet with OS bullseye [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/723658 (owner: 10PipelineBot) [13:55:07] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10JVargas) Thank you so much, @Dzahn and @cmooney! Appreciate the quick support for access. [13:57:13] (03PS1) 10Kormat: wmfdb/db: Add module for querying databases. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) [13:57:18] (03PS1) 10Jbond: C:mw_rc_irc::ircserver: Refresh ircd services on config changes [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) [13:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P18568 and previous config saved to /var/cache/conftool/dbconfig/20220111-135744-marostegui.json [13:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33191/console" [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [13:59:28] (03PS1) 10Cathal Mooney: Add Sandra Ebele Nwachukwu to production access [puppet] - 10https://gerrit.wikimedia.org/r/753049 (https://phabricator.wikimedia.org/T298786) [14:00:02] (03CR) 10Jbond: [V: 03+1] "ready" [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [14:04:10] (03CR) 10Klausman: [C: 03+1] "Just one nit, other than that, LGTM" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:10:27] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Add deletion for anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/753052 (https://phabricator.wikimedia.org/T298972) [14:12:10] (03PS2) 10Mforns: analytics:refinery:job:data_purge: Add deletion for anomaly detection [puppet] - 10https://gerrit.wikimedia.org/r/753052 (https://phabricator.wikimedia.org/T298972) [14:12:12] (03CR) 10Cathal Mooney: [C: 03+2] Add Sandra Ebele Nwachukwu to production access [puppet] - 10https://gerrit.wikimedia.org/r/753049 (https://phabricator.wikimedia.org/T298786) (owner: 10Cathal Mooney) [14:12:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T297191)', diff saved to https://phabricator.wikimedia.org/P18569 and previous config saved to /var/cache/conftool/dbconfig/20220111-141249-marostegui.json [14:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:53] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [14:12:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [14:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [14:12:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [14:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [14:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] (03PS1) 10Jbond: O:puppetmaster::standalone: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/753053 (https://phabricator.wikimedia.org/T284082) [14:13:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T297191)', diff saved to https://phabricator.wikimedia.org/P18570 and previous config saved to /var/cache/conftool/dbconfig/20220111-141318-marostegui.json [14:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:08] (03CR) 10Jbond: [C: 03+2] O:puppetmaster::standalone: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/753053 (https://phabricator.wikimedia.org/T284082) (owner: 10Jbond) [14:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T297191)', diff saved to https://phabricator.wikimedia.org/P18571 and previous config saved to /var/cache/conftool/dbconfig/20220111-141425-marostegui.json [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:54] topranks: have merged your access request change [14:15:15] ok sry got pulled away for a moment. [14:15:20] no probs [14:15:21] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [14:15:46] thanks :) [14:15:51] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Add type validation to puppetmaster::standalone - https://phabricator.wikimedia.org/T284082 (10jbond) 05Open→03Resolved a:03jbond [14:18:19] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10Dzahn) @cmooney Cool, thanks for closing this! What Andre meant here above is that we are also supposed to add users to the Phabricator group called WMF-NDA when we add people into the LDAP g... [14:19:13] (03CR) 10Cathal Mooney: "Great work John. Overall I am fully supportive of this change, it adds a very valuable layer of security and I don't expect it will be hu" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:19:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1021.eqiad.wmnet with OS bullseye [14:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:06] (03PS2) 10Kormat: wmfdb/db: Add module for querying databases. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) [14:21:21] (03CR) 10Kormat: wmfdb/db: Add module for querying databases. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:21:24] jouncebot: nowandnext [14:21:24] No deployments scheduled for the next 2 hour(s) and 38 minute(s) [14:21:24] In 2 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1700) [14:22:05] (03CR) 10Majavah: [C: 03+2] reverse-proxy: add drmrs ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751952 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [14:22:16] (03CR) 10Klausman: [C: 03+1] wmfdb/db: Add module for querying databases. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:22:49] (03Merged) 10jenkins-bot: reverse-proxy: add drmrs ranges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751952 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [14:23:20] (03CR) 10Kormat: [C: 03+2] wmfdb/db: Add module for querying databases. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:25:00] (03Merged) 10jenkins-bot: wmfdb/db: Add module for querying databases. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753045 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:25:36] !log taavi@deploy1002 Synchronized wmf-config/reverse-proxy.php: Config: [[gerrit:751952|reverse-proxy: add drmrs ranges (T282787)]] (duration: 01m 36s) [14:25:37] (03PS2) 10Majavah: Clean up nova-network remains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751949 [14:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:40] T282787: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 [14:26:17] (03CR) 10Majavah: [C: 03+2] Clean up nova-network remains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751949 (owner: 10Majavah) [14:26:55] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:27:01] (03Merged) 10jenkins-bot: Clean up nova-network remains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751949 (owner: 10Majavah) [14:28:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P18572 and previous config saved to /var/cache/conftool/dbconfig/20220111-142930-marostegui.json [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:04] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:751949|Clean up nova-network remains]] (1/2) (duration: 02m 49s) [14:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:32:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:59] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T298719 (10cmooney) 05In progress→03Resolved @aklapper / @dzahn many thanks for spotting the omission and kindly correcting. Duly noted for future similar requests. [14:33:05] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [14:33:55] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:751949|Clean up nova-network remains]] (2/2) (duration: 02m 40s) [14:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:01] * taavi done [14:35:21] !log Upgrade pc1014 mysql [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:12] (03CR) 10Ayounsi: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:38:23] (03CR) 10Elukey: [C: 03+1] Exclude log4j_extras from the classpath for coordinators [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [14:38:36] !log disable ping-offload in eqiad [14:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P18573 and previous config saved to /var/cache/conftool/dbconfig/20220111-144435-marostegui.json [14:44:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ping1002.eqiad.wmnet [14:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] (03PS3) 10Eigyan: wmf-config: Update coverage to 0.5 in gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752708 (https://phabricator.wikimedia.org/T297623) [14:48:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ping1002.eqiad.wmnet [14:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM zookeeper-test1002.eqiad.wmnet [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:33] PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:54:25] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:39] RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:55:17] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10SRE Observability (FY2021/2022-Q3): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) [14:55:43] PROBLEM - Disk space on pybal-test2002 is CRITICAL: DISK CRITICAL - free space: / 170 MB (1% inode=83%): /tmp 170 MB (1% inode=83%): /var/tmp 170 MB (1% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=pybal-test2002&var-datasource=codfw+prometheus/ops [14:56:25] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM etherpad1002.eqiad.wmnet [14:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:31] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM etherpad1002.eqiad.wmnet rebooted by aokoth@cumin1001 with reason: Ganeti Migration [14:57:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10cmooney) Hi Sandra, I have now: - Added you to LDAP group 'wmf' - Added you as a member of the 'WMF-NDA' group in Phabricator - Ad... [14:58:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM zookeeper-test1002.eqiad.wmnet [14:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:11] PROBLEM - Check systemd state on zookeeper-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:43] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [14:59:07] PROBLEM - PHP7 rendering on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:59:19] PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:59:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T297191)', diff saved to https://phabricator.wikimedia.org/P18574 and previous config saved to /var/cache/conftool/dbconfig/20220111-145939-marostegui.json [14:59:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:59:43] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [14:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T297191)', diff saved to https://phabricator.wikimedia.org/P18575 and previous config saved to /var/cache/conftool/dbconfig/20220111-145947-marostegui.json [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM etherpad1002.eqiad.wmnet [15:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T297191)', diff saved to https://phabricator.wikimedia.org/P18576 and previous config saved to /var/cache/conftool/dbconfig/20220111-150054-marostegui.json [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:41] RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.307 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:02:04] !log aokoth@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM otrs1001.eqiad.wmnet [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:11] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM otrs1001.eqiad.wmnet rebooted by aokoth@cumin1001 with reason: Ganeti Migration [15:04:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM rpki1001.eqiad.wmnet [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:35] PROBLEM - Disk space on prometheus2004 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [15:05:53] (03PS1) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) [15:06:17] PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:07:01] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33192/console" [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:07:41] (03CR) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:08:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM rpki1001.eqiad.wmnet [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:05] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM otrs1001.eqiad.wmnet [15:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:15] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:41] (03CR) 10David Caro: [C: 03+2] wmcs::db: remove used roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/753010 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:12:08] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10Arnoldokoth) [15:12:15] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [15:12:35] RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:12:55] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:13:17] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [15:14:37] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10cmooney) p:05Triage→03Medium a:03cmooney [15:14:50] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10cmooney) p:05Triage→03Low [15:15:05] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:15:21] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:15:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P18577 and previous config saved to /var/cache/conftool/dbconfig/20220111-151558-marostegui.json [15:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:59] PROBLEM - PHP7 jobrunner on mw1303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:17:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:19:45] (03PS1) 10Cparle: Updated maint script to use fewer queries [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753060 (https://phabricator.wikimedia.org/T297484) [15:19:58] (03PS1) 10Cathal Mooney: Adding user Antoine Qu'hen to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/753061 (https://phabricator.wikimedia.org/T298657) [15:20:47] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:11] (03CR) 10Cathal Mooney: [C: 03+2] Adding user Antoine Qu'hen to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/753061 (https://phabricator.wikimedia.org/T298657) (owner: 10Cathal Mooney) [15:22:45] !log systemctl reset-failed ifup@ens5.service on otrs1001 T273026 [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:48] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [15:22:53] (03PS2) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 [15:22:55] (03PS2) 10Giuseppe Lavagetto: shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) [15:22:57] (03PS1) 10Giuseppe Lavagetto: shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062 [15:23:25] (03CR) 10jerkins-bot: [V: 04-1] shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062 (owner: 10Giuseppe Lavagetto) [15:24:26] (03CR) 10jerkins-bot: [V: 04-1] shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [15:24:36] (03CR) 10jerkins-bot: [V: 04-1] shellbox-*: promote to new build [deployment-charts] - 10https://gerrit.wikimedia.org/r/753021 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [15:24:55] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban, 10Patch-For-Review: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10cmooney) a:05BTullis→03cmooney On the back of Olja's explicit approval I've added the username to the '... [15:30:10] !log Decommissioning cassandra instance restbase2009-a via nodetool [15:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P18578 and previous config saved to /var/cache/conftool/dbconfig/20220111-153103-marostegui.json [15:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:33:29] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) I can assist with this. I believe once SRE removes the aliases from their side, ITS can ad... [15:39:32] (03PS1) 10Cathal Mooney: Add Elliot Eggleston (ejegg) to fr-tech-ops Icinga contact group. [puppet] - 10https://gerrit.wikimedia.org/r/753065 (https://phabricator.wikimedia.org/T298649) [15:44:30] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, and 2 others: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10cmooney) a:03cmooney @jglesson hey just following up on this as I am on Clinic Duty this week.... [15:44:34] (03CR) 10Cathal Mooney: [C: 03+2] Add Elliot Eggleston (ejegg) to fr-tech-ops Icinga contact group. [puppet] - 10https://gerrit.wikimedia.org/r/753065 (https://phabricator.wikimedia.org/T298649) (owner: 10Cathal Mooney) [15:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T297191)', diff saved to https://phabricator.wikimedia.org/P18579 and previous config saved to /var/cache/conftool/dbconfig/20220111-154608-marostegui.json [15:46:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:46:12] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [15:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18580 and previous config saved to /var/cache/conftool/dbconfig/20220111-154615-marostegui.json [15:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:26] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10ayounsi) p:05Triage→03Medium [15:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18582 and previous config saved to /var/cache/conftool/dbconfig/20220111-154722-marostegui.json [15:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:37] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10cmooney) p:05Triage→03Medium a:03cmooney [15:47:47] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief-test1001.eqiad.wmnet [15:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:52] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM acmechief-test1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None [15:48:06] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10ayounsi) [15:51:05] !log restart elasticserach_6@production-search-psi-eqiad on elastic1049 to resolve issue with full heap [15:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief-test1001.eqiad.wmnet [15:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:55:52] !log disable puppet on acme-chief clients for acmechief1001 reboot - T294120 [15:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:55] T294120: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 [15:56:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2009.codfw.wmnet with reason: Decommissioning - hnowlan [15:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2009.codfw.wmnet with reason: Decommissioning - hnowlan [15:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:31] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM acmechief1001.eqiad.wmnet [15:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:37] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM acmechief1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None [15:58:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM acmechief1001.eqiad.wmnet [15:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:11] !log re-enable puppet on acme-chief clients after acmechief1001 reboot - T294120 [15:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:19] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10Vgutierrez) [16:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P18583 and previous config saved to /var/cache/conftool/dbconfig/20220111-160227-marostegui.json [16:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:05] !log begin rolling restart of opensearch in codfw - jvm upgrade [16:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] PROBLEM - PHP7 jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [16:05:26] RECOVERY - PHP7 jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.784 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:06:08] PROBLEM - PHP7 rendering on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:38] RECOVERY - PHP7 rendering on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:02] RECOVERY - PHP7 jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:09:34] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) a:03JAllemandou [16:10:10] PROBLEM - PHP7 jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [16:10:58] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Thanks @cmooney and yeah it makes sense not to give us permissions we don't need... [16:12:37] (03CR) 10Jgiannelos: [C: 04-1] maps: add cassandra toggle, disable cassandra on maps hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [16:13:30] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10bking) [16:13:55] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10bking) [16:14:38] (03CR) 10Jgiannelos: [C: 04-1] maps: add cassandra toggle, disable cassandra on maps hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [16:14:48] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10bking) [16:16:44] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:17:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P18584 and previous config saved to /var/cache/conftool/dbconfig/20220111-161732-marostegui.json [16:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:44] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:19:01] (03CR) 10Ottomata: [C: 03+2] Import commons mediainfo json dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/738874 (https://phabricator.wikimedia.org/T258834) (owner: 10Joal) [16:19:42] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10Ottomata) Approved. [16:20:10] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10Gehel) Approved [16:22:34] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) @cmooney Perfect, I wanted to add exactly that but you already got it :) thanks [16:23:50] !log aborrero@apt1001:~ $ sudo -i reprepro --noskipold --component thirdparty/kubeadm-k8s-1-21 update buster-wikimedia [16:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: raise default to 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/739402 (owner: 10Majavah) [16:25:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [16:26:09] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [16:26:40] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [16:29:20] !log aborrero@apt1001:~ $ sudo -i reprepro clearvanished [16:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:20] RECOVERY - PHP7 rendering on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:32:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18585 and previous config saved to /var/cache/conftool/dbconfig/20220111-163237-marostegui.json [16:32:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:32:40] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [16:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T297191)', diff saved to https://phabricator.wikimedia.org/P18586 and previous config saved to /var/cache/conftool/dbconfig/20220111-163244-marostegui.json [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T297191)', diff saved to https://phabricator.wikimedia.org/P18587 and previous config saved to /var/cache/conftool/dbconfig/20220111-163351-marostegui.json [16:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:50] PROBLEM - PHP7 rendering on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:37:23] (03PS1) 10Jcrespo: mediabackup: Backup s1 (enwiki) media files on codfw [puppet] - 10https://gerrit.wikimedia.org/r/753095 (https://phabricator.wikimedia.org/T262668) [16:38:15] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Backup s1 (enwiki) media files on codfw [puppet] - 10https://gerrit.wikimedia.org/r/753095 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:40:32] RECOVERY - PHP7 jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:42:12] RECOVERY - PHP7 rendering on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:44:07] (03PS1) 10Ladsgroup: export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753069 (https://phabricator.wikimedia.org/T163532) [16:44:25] (03CR) 10JHathaway: [C: 03+1] "looks good, one question" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:44:34] jouncebot: nowandnext [16:44:34] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [16:44:35] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1700) [16:44:46] (03CR) 10Ladsgroup: [C: 03+2] "Catching the train" [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753069 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [16:45:33] 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10bking) This is confirmed working, feel free to close this ticket. [16:46:44] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Michael) 05Resolved→03Open Nothing was compromised, but I was stupid when playing around with my password manager and am no longer able to unlock the ssh key added for m... [16:47:00] (03PS7) 10Ayounsi: profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:47:48] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10MarkTraceur) Approved! Re: Specific access, this is part of our onboarding checklist. It says: "Create a Phabricator task to request access to the group ldap/wmf for your Gerrit account[.]... [16:48:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P18588 and previous config saved to /var/cache/conftool/dbconfig/20220111-164856-marostegui.json [16:48:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) a:05Cmjohnson→03elukey @elukey when you have a moment can you look at the partman recipe for this and let me know if it's cor... [16:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:06] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10Cmjohnson) @ArielGlenn The part is arriving today, can I do this tomorrow 1530 UTC? [16:54:29] (03CR) 10JHathaway: [C: 03+1] "looks good, one question?" [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [16:54:31] (03PS1) 10Jcrespo: mediabackup: Update mediawiki replica for s1 backup on codfw [puppet] - 10https://gerrit.wikimedia.org/r/753099 (https://phabricator.wikimedia.org/T262668) [16:54:50] (03PS2) 10Jcrespo: mediabackup: Update mediawiki replica for s1 backup on codfw [puppet] - 10https://gerrit.wikimedia.org/r/753099 (https://phabricator.wikimedia.org/T262668) [16:54:53] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10Cmjohnson) @elukey Can we plan to do this tomorrow (12 Jan) starting around 1500UTC? [16:55:17] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 [16:56:05] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) I've tried a different partman recipe. I do not know what is wrong or why the raid fails. [16:56:17] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 (owner: 10Andrew Bogott) [16:57:08] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update mediawiki replica for s1 backup on codfw [puppet] - 10https://gerrit.wikimedia.org/r/753099 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:57:21] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10ArielGlenn) >>! In T298582#7613348, @Cmjohnson wrote: > @ArielGlenn The part is arriving today, can I do this tomorrow 1530 UTC? Yes please! [16:58:26] PROBLEM - PHP7 jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [16:59:56] (03PS2) 10Andrew Bogott: profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 (https://phabricator.wikimedia.org/T291405) [17:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:13] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:34] PROBLEM - PHP7 rendering on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:00:36] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [17:01:52] (03PS3) 10Andrew Bogott: profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 (https://phabricator.wikimedia.org/T291405) [17:03:10] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:31] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir1001.eqiad.wmnet [17:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:38] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM ncredir1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None [17:03:43] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P18589 and previous config saved to /var/cache/conftool/dbconfig/20220111-170400-marostegui.json [17:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:34] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:46] (03CR) 10jerkins-bot: [V: 04-1] export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753069 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [17:06:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:06] RECOVERY - PHP7 rendering on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.536 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:06:16] RECOVERY - PHP7 jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:06:19] (03CR) 10Ladsgroup: [C: 03+2] "." [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753069 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [17:06:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:14] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@65895c0]: Remove cassandra from kartotherian sources [17:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir1001.eqiad.wmnet [17:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 Kafka main eqiad servers with buster / firmware update needed - https://phabricator.wikimedia.org/T298867 (10elukey) @Cmjohnson perfect thanks! [17:08:50] !log vgutierrez@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ncredir1002.eqiad.wmnet [17:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:56] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM ncredir1002.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None [17:10:48] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@65895c0]: Remove cassandra from kartotherian sources (duration: 03m 33s) [17:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:48] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@65895c0]: Remove cassandra from kartotherian sources [17:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir1002.eqiad.wmnet [17:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:51] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10Vgutierrez) [17:13:51] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@65895c0]: Remove cassandra from kartotherian sources (duration: 02m 04s) [17:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:07] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::standalone: bind service IP to VM [puppet] - 10https://gerrit.wikimedia.org/r/753100 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [17:15:16] (03CR) 10JMeybohm: [C: 03+1] deployment_server,::helm: remove helm2 support [puppet] - 10https://gerrit.wikimedia.org/r/753026 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [17:17:32] (03CR) 10JHathaway: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [17:19:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T297191)', diff saved to https://phabricator.wikimedia.org/P18590 and previous config saved to /var/cache/conftool/dbconfig/20220111-171905-marostegui.json [17:19:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:19:09] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [17:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18591 and previous config saved to /var/cache/conftool/dbconfig/20220111-171912-marostegui.json [17:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18592 and previous config saved to /var/cache/conftool/dbconfig/20220111-172019-marostegui.json [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:25] (03PS1) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [17:22:02] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [17:24:26] PROBLEM - PHP7 jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:25:11] (03Merged) 10jenkins-bot: export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753069 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [17:28:34] RECOVERY - PHP7 jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:28:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] PROBLEM - PHP7 jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:29:32] PROBLEM - PHP7 rendering on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:31:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:31:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:36] PROBLEM - PHP7 rendering on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:33:04] PROBLEM - PHP7 jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:34:04] (03PS3) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 [17:34:06] (03PS1) 10Giuseppe Lavagetto: Rakefile: when copying over helmfile directories, resolve symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/753104 [17:35:18] RECOVERY - PHP7 jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.630 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P18593 and previous config saved to /var/cache/conftool/dbconfig/20220111-173524-marostegui.json [17:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:04] RECOVERY - PHP7 rendering on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:36:06] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Dzahn) a:05ssingh→03None [17:43:58] PROBLEM - PHP7 jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:44:18] RECOVERY - PHP7 jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.507 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:44:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2009.codfw.wmnet [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:36] RECOVERY - PHP7 rendering on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:46:04] RECOVERY - PHP7 jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.012 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:47:55] (03CR) 10Btullis: [V: 03+1 C: 03+2] Exclude log4j_extras from the classpath for coordinators [puppet] - 10https://gerrit.wikimedia.org/r/752673 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [17:48:46] PROBLEM - PHP7 jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:50:06] PROBLEM - PHP7 rendering on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:50:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P18594 and previous config saved to /var/cache/conftool/dbconfig/20220111-175029-marostegui.json [17:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:14] PROBLEM - PHP7 jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [17:59:20] RECOVERY - PHP7 jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1800). [18:01:57] (03PS1) 10Elukey: install_server: fix netboot settings for an-test-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/753110 (https://phabricator.wikimedia.org/T293938) [18:01:58] RECOVERY - PHP7 jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:03:22] RECOVERY - PHP7 rendering on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:03:24] (03CR) 10Elukey: [C: 03+2] install_server: fix netboot settings for an-test-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/753110 (https://phabricator.wikimedia.org/T293938) (owner: 10Elukey) [18:05:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T297191)', diff saved to https://phabricator.wikimedia.org/P18595 and previous config saved to /var/cache/conftool/dbconfig/20220111-180534-marostegui.json [18:05:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [18:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:38] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [18:05:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [18:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T297191)', diff saved to https://phabricator.wikimedia.org/P18596 and previous config saved to /var/cache/conftool/dbconfig/20220111-180547-marostegui.json [18:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T297191)', diff saved to https://phabricator.wikimedia.org/P18597 and previous config saved to /var/cache/conftool/dbconfig/20220111-180653-marostegui.json [18:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS buster [18:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:56] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host an-test-coord1002.eqiad.wmnet with O... [18:11:54] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10cmooney) @CMacholan can you confirm and approve this request if appropriate? thanks. [18:12:31] (03PS1) 10Jgiannelos: Disable triggering tile pregeneration on OSM syncs [puppet] - 10https://gerrit.wikimedia.org/r/753111 [18:13:33] (03PS2) 10Jgiannelos: Disable triggering tile pregeneration on OSM syncs [puppet] - 10https://gerrit.wikimedia.org/r/753111 (https://phabricator.wikimedia.org/T298246) [18:14:41] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10CMacholan) @cmooney approved [18:18:16] (03PS2) 10Giuseppe Lavagetto: Rakefile: when copying over helmfile directories, resolve symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/753104 [18:18:18] (03PS4) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 [18:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P18598 and previous config saved to /var/cache/conftool/dbconfig/20220111-182158-marostegui.json [18:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: when copying over helmfile directories, resolve symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/753104 (owner: 10Giuseppe Lavagetto) [18:28:38] (03Merged) 10jenkins-bot: Rakefile: when copying over helmfile directories, resolve symlinks [deployment-charts] - 10https://gerrit.wikimedia.org/r/753104 (owner: 10Giuseppe Lavagetto) [18:28:59] <_joe_> !log uploaded scap 4.1.1-1 to apt T298986 [18:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:03] T298986: Deploy Scap version 4.1.1 - https://phabricator.wikimedia.org/T298986 [18:29:54] 10SRE, 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10RobH) >>! In T298869#7610301, @Cmjohnson wrote: > The mgmt switch power led was amber, tried pulling the power and plugging back in but no change. We had a spare wmf4921, racked it, and moved all the mgmt cables.... [18:29:56] * dancy touches his fingers together and says "eeeexcellent" like Mr Burns. [18:32:30] 10SRE, 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10Cmjohnson) We had a spare on-site thankfully but we should probably purchase a new one just in case or save a couple of the older models for emergencies. [18:34:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS buster [18:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:35] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster completed: - an-t... [18:34:41] 10SRE, 10Analytics-Legal: Options for creating internal (NDA-requiring) dashboards based on data from Google and Big search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) [18:34:48] <_joe_> dancy: I will deploy it to the mwdebug servers for now, and try a scap pool [18:34:58] 👍🏾 [18:37:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P18599 and previous config saved to /var/cache/conftool/dbconfig/20220111-183703-marostegui.json [18:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:27] (03CR) 10JMeybohm: [C: 04-1] shellbox: rationalize version handling, promote to 1.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [18:39:41] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10elukey) @Cmjohnson an-test-coord1002 done, there was an issue with your partman patch (it was targeting an-test-worker1002 instead of an-test-coord1002), bu... [18:41:04] <_joe_> !log installed scap 4.1.1 on mwdebug1002 T298986, ran scap pull successfully [18:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:07] T298986: Deploy Scap version 4.1.1 - https://phabricator.wikimedia.org/T298986 [18:41:21] <_joe_> !log also ran apt-get autoremove on mwdebug1002 [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:17] thx _joe_ [18:46:03] Stepping out for a bit. [18:46:31] (03PS5) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 [18:46:54] (03CR) 10Giuseppe Lavagetto: shellbox: rationalize version handling, promote to 1.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [18:49:06] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) 05Resolved→03Open reopening While we have purged all the font packages that were specifically in the config, these had also pulled in more font packag... [18:50:11] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) example: a font that was in our list, fonts-alee is properly gone: https://debmonitor.wikimedia.org/packages/fonts-alee (except on thumbor, expected) a f... [18:51:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:13] (03CR) 10JMeybohm: [C: 03+1] shellbox: rationalize version handling, promote to 1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/753020 (owner: 10Giuseppe Lavagetto) [18:52:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:52:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T297191)', diff saved to https://phabricator.wikimedia.org/P18600 and previous config saved to /var/cache/conftool/dbconfig/20220111-185208-marostegui.json [18:52:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:52:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [18:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:13] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [18:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T297191)', diff saved to https://phabricator.wikimedia.org/P18601 and previous config saved to /var/cache/conftool/dbconfig/20220111-185215-marostegui.json [18:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:14] (03PS2) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [18:53:16] (03PS1) 10Jbond: hieradata - cloud: add config for prefies [puppet] - 10https://gerrit.wikimedia.org/r/753117 [18:53:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T297191)', diff saved to https://phabricator.wikimedia.org/P18602 and previous config saved to /var/cache/conftool/dbconfig/20220111-185322-marostegui.json [18:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:52] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [18:55:27] (03PS3) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [18:56:04] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [18:56:13] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10RobH) [18:57:30] !log clear wcqs.jnl and aliases.map for all wcqs instances T296470 [18:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:34] T296470: Initialize WCQS production servers - https://phabricator.wikimedia.org/T296470 [18:58:38] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM doh1001.wikimedia.org [18:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:47] 10SRE, 10Analytics-Legal: Options for creating internal (NDA-requiring) dashboards based on data from Google and Big search consoles - https://phabricator.wikimedia.org/T298991 (10RhinosF1) #Analytics-Legal says "Public project for the Analytics and Techops team for reviewing incoming requests from WMF-Legal.... [19:00:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T1900) [19:00:57] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh1001.wikimedia.org [19:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:53] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM doh1002.wikimedia.org [19:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:00] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM doh1002.wikimedia.org rebooted by sukhe@cumin1001 with reason: rebooting for T294120 [19:02:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:04:09] (03CR) 10Herron: [C: 03+1] site: reprovision eqiad logging cluster to opensearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/752756 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [19:04:27] 10SRE, 10WMF-Legal: Options for creating internal (NDA-requiring) dashboards based on data from Google and Big search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that p... [19:04:43] !log dduvall@deploy1002 Pruned MediaWiki: 1.38.0-wmf.9 (duration: 15m 51s) [19:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:02] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh1002.wikimedia.org [19:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:08] (03CR) 10Herron: [C: 03+1] hiera: add opensearch production configuration (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/752755 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [19:05:34] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM durum1001.eqiad.wmnet [19:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:41] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM durum1001.eqiad.wmnet rebooted by sukhe@cumin1001 with reason: rebooting for T294120 [19:06:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:06:06] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Big search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) [19:08:12] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) [19:08:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P18603 and previous config saved to /var/cache/conftool/dbconfig/20220111-190827-marostegui.json [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:15] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) Thanks much @RhinosF1! I'll reach out directly to Legal about this as specified. [19:12:09] (03PS4) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [19:12:47] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [19:13:22] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum1001.eqiad.wmnet [19:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:36] !log sukhe@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM durum1002.eqiad.wmnet [19:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:44] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM durum1002.eqiad.wmnet rebooted by sukhe@cumin1001 with reason: rebooting for T294120 [19:15:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:15:36] (03PS5) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [19:16:47] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [19:17:05] !log sukhe@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum1002.eqiad.wmnet [19:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:58] (03PS1) 10Urbanecm: wgGEMentorDashboardDeploymentMode should be alpha in all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753119 (https://phabricator.wikimedia.org/T298993) [19:20:16] (03PS1) 10Dduvall: testwikis wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753120 [19:20:18] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753120 (owner: 10Dduvall) [19:20:22] (03PS6) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [19:21:40] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [19:21:42] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753120 (owner: 10Dduvall) [19:21:46] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.17 refs T293958 [19:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:49] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [19:22:59] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ssingh) [19:23:01] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [19:23:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P18604 and previous config saved to /var/cache/conftool/dbconfig/20220111-192331-marostegui.json [19:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:44] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ssingh) `doh1001` was also restarted but I forgot to add the `-t` switch and that's why you ops-bot didn't catch it :) Updated the hosts. [19:24:24] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration - https://phabricator.wikimedia.org/T292603 (10herron) [19:24:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:24:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:03] (03PS7) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [19:27:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:39] (03CR) 10Jbond: [C: 03+2] "merging this is currently not used but will test in cloud" [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [19:29:06] (03PS8) 10Jbond: bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) [19:30:10] !log upload pdns-recursor_4.6.0-1wm1 to apt.wm.o (buster) - T252132 [19:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:16] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [19:30:17] (03CR) 10Jbond: [C: 03+2] bgpalerter: add new class to configure bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [19:30:29] (03CR) 10Jbond: [C: 03+2] hieradata - cloud: add config for prefies [puppet] - 10https://gerrit.wikimedia.org/r/753117 (owner: 10Jbond) [19:30:46] (03PS2) 10Jbond: hieradata - cloud: add config for prefies [puppet] - 10https://gerrit.wikimedia.org/r/753117 [19:34:49] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10LDAP-Access-Requests: Create Kerberos login for Brian King (bking) - https://phabricator.wikimedia.org/T298981 (10cmooney) 05Open→03Resolved a:03cmooney Ok no problem if there is anything not working just drop me a line on irc :) [19:35:20] (03PS1) 10Jbond: hieradata: fix profix_options hash [puppet] - 10https://gerrit.wikimedia.org/r/753121 [19:36:22] (03CR) 10Jbond: [C: 03+2] hieradata: fix profix_options hash [puppet] - 10https://gerrit.wikimedia.org/r/753121 (owner: 10Jbond) [19:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T297191)', diff saved to https://phabricator.wikimedia.org/P18605 and previous config saved to /var/cache/conftool/dbconfig/20220111-193836-marostegui.json [19:38:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:38:40] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [19:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T297191)', diff saved to https://phabricator.wikimedia.org/P18606 and previous config saved to /var/cache/conftool/dbconfig/20220111-193844-marostegui.json [19:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:07] (03PS1) 10Ssingh: dnsrecursor: remove redundant setting delegation-only [puppet] - 10https://gerrit.wikimedia.org/r/753122 [19:39:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33198/console" [puppet] - 10https://gerrit.wikimedia.org/r/753122 (owner: 10Ssingh) [19:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T297191)', diff saved to https://phabricator.wikimedia.org/P18607 and previous config saved to /var/cache/conftool/dbconfig/20220111-193951-marostegui.json [19:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:54] (03PS1) 10Jbond: bgpalerter: fix type definition [puppet] - 10https://gerrit.wikimedia.org/r/753123 [19:40:47] (03CR) 10Jbond: [C: 03+2] bgpalerter: fix type definition [puppet] - 10https://gerrit.wikimedia.org/r/753123 (owner: 10Jbond) [19:41:38] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: remove redundant setting delegation-only [puppet] - 10https://gerrit.wikimedia.org/r/753122 (owner: 10Ssingh) [19:43:13] (03PS1) 10Jbond: hieradata: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/753124 [19:44:03] (03CR) 10Jbond: [C: 03+2] hieradata: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/753124 (owner: 10Jbond) [19:51:13] (03PS1) 10Jbond: bgpalerter: fix prefixes content [puppet] - 10https://gerrit.wikimedia.org/r/753125 [19:52:21] (03CR) 10Jbond: [C: 03+2] bgpalerter: fix prefixes content [puppet] - 10https://gerrit.wikimedia.org/r/753125 (owner: 10Jbond) [19:53:49] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1023.eqiad.wmnet [19:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P18608 and previous config saved to /var/cache/conftool/dbconfig/20220111-195456-marostegui.json [19:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:26] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:59:29] ^^ is me. kibana7 in eqiad is not the active DC [19:59:31] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1023.eqiad.wmnet [19:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] dduvall and twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220111T2000). [20:00:40] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:01:24] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.17 refs T293958 (duration: 39m 38s) [20:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:28] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:03:00] taavi: temp global rights are fully shipped with wmf.17? [20:06:23] hauskatze: yes, but behind a config flag [20:06:37] I expect they'll be enabled like next Monday [20:06:58] taavi: awesome [20:08:24] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1024.eqiad.wmnet [20:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:30] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1024.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:09:53] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1025.eqiad.wmnet [20:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P18609 and previous config saved to /var/cache/conftool/dbconfig/20220111-201000-marostegui.json [20:10:01] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1025.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:10] (03PS4) 10Ssingh: dnsrecursor: add support for DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752706 [20:11:49] (03PS1) 10Jbond: bgpalerter: use absolute path for prefixes and log directory [puppet] - 10https://gerrit.wikimedia.org/r/753128 [20:12:23] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33200/console" [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [20:12:56] (03CR) 10Jbond: [C: 03+2] bgpalerter: use absolute path for prefixes and log directory [puppet] - 10https://gerrit.wikimedia.org/r/753128 (owner: 10Jbond) [20:16:36] (03CR) 10Ssingh: [V: 03+1] "Merging since there is no change for the internal recursor configuration and PCC looks OK. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [20:16:42] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: add support for DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752706 (owner: 10Ssingh) [20:17:37] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1025.eqiad.wmnet [20:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:45] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1024.eqiad.wmnet [20:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:58] PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:06] PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:59] (03PS2) 10Ssingh: O:wikidough: enable DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752726 [20:20:05] (03PS11) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [20:20:14] (03CR) 10Dzahn: phabricator: move vcs firewall rules to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [20:21:45] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33201/console" [puppet] - 10https://gerrit.wikimedia.org/r/752726 (owner: 10Ssingh) [20:22:14] (03CR) 10Ssingh: [V: 03+1 C: 03+2] O:wikidough: enable DoT to auth servers [puppet] - 10https://gerrit.wikimedia.org/r/752726 (owner: 10Ssingh) [20:22:20] RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:28] RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:08] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1030.eqiad.wmnet [20:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:12] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1031.eqiad.wmnet [20:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:14] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1030.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:23:19] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1031.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:25:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T297191)', diff saved to https://phabricator.wikimedia.org/P18610 and previous config saved to /var/cache/conftool/dbconfig/20220111-202505-marostegui.json [20:25:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [20:25:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [20:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:10] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [20:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T297191)', diff saved to https://phabricator.wikimedia.org/P18611 and previous config saved to /var/cache/conftool/dbconfig/20220111-202513-marostegui.json [20:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:40] (03PS1) 10Jbond: hieradata - clod: add pullapi endpoint [puppet] - 10https://gerrit.wikimedia.org/r/753132 [20:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T297191)', diff saved to https://phabricator.wikimedia.org/P18612 and previous config saved to /var/cache/conftool/dbconfig/20220111-202620-marostegui.json [20:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:49] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1030.eqiad.wmnet [20:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:52] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1031.eqiad.wmnet [20:26:53] (03CR) 10Jbond: [C: 03+2] hieradata - clod: add pullapi endpoint [puppet] - 10https://gerrit.wikimedia.org/r/753132 (owner: 10Jbond) [20:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:31] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1032.eqiad.wmnet [20:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:37] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1032.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:27:52] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1007.eqiad.wmnet [20:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:59] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1007.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:31:03] (03PS1) 10Jbond: bgpalerter - hierdata: use standard port [puppet] - 10https://gerrit.wikimedia.org/r/753135 [20:31:12] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1032.eqiad.wmnet [20:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:52] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1007.eqiad.wmnet [20:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:08] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1008.eqiad.wmnet [20:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:14] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1008.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:32:28] (03CR) 10Jbond: [C: 03+2] bgpalerter - hierdata: use standard port [puppet] - 10https://gerrit.wikimedia.org/r/753135 (owner: 10Jbond) [20:34:14] 10SRE, 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10wiki_willy) For sure, agreed @Cmjohnson. Once the new Netgear switches arrive for the expansion cage in April, we can hold onto some of the temp msw's we're currently using as future spares. @ayounsi - are we goo... [20:35:07] (03PS1) 10Dduvall: group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753138 [20:35:09] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753138 (owner: 10Dduvall) [20:36:03] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1008.eqiad.wmnet [20:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:47] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753138 (owner: 10Dduvall) [20:38:30] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.17 refs T293958 [20:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:33] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:38:47] !log cwhite@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM logstash1009.eqiad.wmnet [20:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:53] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10ops-monitoring-bot) VM logstash1009.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None [20:41:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P18613 and previous config saved to /var/cache/conftool/dbconfig/20220111-204124-marostegui.json [20:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:42] !log cwhite@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM logstash1009.eqiad.wmnet [20:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:43:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:35] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10colewhite) [20:45:06] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10RobH) [20:45:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:29] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10RobH) [20:49:59] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/33202/" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [20:50:19] (03CR) 10Dzahn: [C: 03+1] phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [20:54:22] (03PS1) 10Jbond: hieradata: add ASN name comments [puppet] - 10https://gerrit.wikimedia.org/r/753147 [20:56:00] !log mw1418 (lowest numbered canary appserver that we use for httpbb hourly tests on cumin1001) - apt-get autoremove - removed font* and python3* packages - reason: T294378 [20:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:03] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [20:56:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P18614 and previous config saved to /var/cache/conftool/dbconfig/20220111-205629-marostegui.json [20:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:29] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) Doing the `apt-get autoremove` and accepting what it suggests also removes python packages in addition to font packages. When running puppet afterwards th... [21:04:40] (03CR) 10Jbond: "This just adds some comments to the as list generated bu bgpalerter. Im guessing its so big and random due to the route servers." [puppet] - 10https://gerrit.wikimedia.org/r/753147 (owner: 10Jbond) [21:05:35] (03PS2) 10Jdlrobson: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) [21:05:45] (03PS3) 10Jdlrobson: Skip vector-2022 skin in config, not inside Vector skin codebase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) [21:11:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T297191)', diff saved to https://phabricator.wikimedia.org/P18615 and previous config saved to /var/cache/conftool/dbconfig/20220111-211134-marostegui.json [21:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:38] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [21:15:31] (03CR) 10Jbond: hieradata: add ASN name comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753147 (owner: 10Jbond) [21:18:45] 10SRE, 10ops-eqiad: msw-a8-eqiad potentially down - https://phabricator.wikimedia.org/T298869 (10Cmjohnson) 05Open→03Resolved netbox updated with msw1 connection changed the broken name to msw-a8-eqiad-broken and placed as failed for the time being. [21:20:23] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) 05Open→03Resolved Thanks @elukey resolving the task [21:29:50] !log mw1418 - apt-get remove --purge fonts*; apt-get remove --purge xfonts*; running puppet - nothing gets reinstalled and with --purge it means 'dpkg -l | grep fonts' is actually empty, not full of "rc" still - T294378 [21:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:54] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [22:13:48] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10wiki_willy) a:03Jclark-ctr [22:15:33] dduvall: T298999 probably merits a rollback [22:15:34] T298999: [regression-wmf.16] testwiki - cannot publish an edit - https://phabricator.wikimedia.org/T298999 [22:15:48] (can repro outside testwiki) [22:16:02] tgr: will do. thanks for the report [22:16:21] dduvall: tgr: oops, i think we have a fix already [22:16:41] i'll have to review the risky changes to make sure there isn't something blocking rollback [22:16:45] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/753113 [22:16:50] MatmaRex: ok. what's the eta? [22:17:03] but we haven't checked if the wmf branch was affected, oops [22:17:07] ah, merged [22:17:42] dduvall: yeah, just need to backport [22:18:02] was caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/734332 [22:18:39] (03PS1) 10Bartosz Dziewoński: Watchlist API update: Call correct method [extensions/VisualEditor] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753071 [22:18:49] dduvall: backport is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/753071 [22:19:02] ah, you beat me to it :) [22:19:08] can you merge/deploy it? sorry about the problem [22:19:24] sure [22:19:44] jouncebot: now [22:19:44] No deployments scheduled for the next 1 hour(s) and 40 minute(s) [22:20:30] (03PS2) 10Bartosz Dziewoński: Watchlist API update: Call correct method [extensions/VisualEditor] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753071 (https://phabricator.wikimedia.org/T298999) [22:22:11] * urbanecm waves and is around to help if needed [22:38:08] (03CR) 10Dduvall: [C: 03+2] Watchlist API update: Call correct method [extensions/VisualEditor] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753071 (https://phabricator.wikimedia.org/T298999) (owner: 10Bartosz Dziewoński) [22:39:47] * dduvall waves to urbanecm in appreciation [22:40:09] no problem :). Ping if you need me. [22:40:21] will do. just waiting on jenkins to deploy [22:40:33] waiting on jenkins *before* i deploy [22:40:53] in case i confuse someone that we suddenly have continuous deployment :) [22:41:12] well, we sort of have :) [22:41:18] at master branch and beta [22:41:32] that is true [22:48:13] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) The output of 'run-puppet-agent' : https://phabricator.wikimedia.org/P18581 [22:56:03] (03Merged) 10jenkins-bot: Watchlist API update: Call correct method [extensions/VisualEditor] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753071 (https://phabricator.wikimedia.org/T298999) (owner: 10Bartosz Dziewoński) [23:03:12] (03PS1) 10Samwilson: Enable Disambiguator notifications for French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753175 (https://phabricator.wikimedia.org/T293319) [23:04:14] !log syncing backport to fix VE regression that followed testwiki/group0 deployment (cc T293958) [23:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:18] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [23:05:39] !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: Backport: [[gerrit:753071|Watchlist API update: Call correct method (T298999)]] (duration: 02m 40s) [23:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:42] T298999: [regression-wmf.17] testwiki - cannot publish an edit - https://phabricator.wikimedia.org/T298999 [23:06:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:36] tgr or urbanecm, the fix is deployed. would you mind verifying? [23:06:46] certainly [23:07:14] i confirm i can edit at testwiki [23:07:31] yay. thank you :) [23:12:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:12:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:16] 10SRE, 10Data-Engineering, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10odimitrijevic) [23:24:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:32] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:59] (03CR) 10Cwhite: [C: 03+2] hiera: add opensearch production configuration (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/752755 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [23:30:14] (03CR) 10Cwhite: [C: 03+2] role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [23:30:22] (03PS11) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [23:30:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:30:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:58] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [23:38:21] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [23:46:47] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy) [23:48:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:20] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Papaul) [23:52:05] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) Relocated msw2-eqiad and completed cross connect in new and old cage Please confirm link before I close ticket @ayounsi. All Pdu's and Switches are connected to msw2-eqiad but netbox is not... [23:52:54] 10SRE, 10serviceops, 10User-Ladsgroup, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Krinkle) 05Open→03Resolved a:03Ladsgroup The immediate issue appears resolved, as ev... [23:55:48] (03PS1) 10Clare Ming: Add new vector skin key to RelatedArticlesFooterAllowedSkins. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753187 (https://phabricator.wikimedia.org/T298916) [23:56:12] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [23:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:53] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) From {T297316} ` from /srv/mediawiki/php-1.38.0-wmf.9/includes/libs/objectcache/MemcachedPeclBagOStuff.php(341) #0 /srv/mediawiki/php-1... [23:59:58] (03CR) 10Jdlrobson: [C: 03+1] Add new vector skin key to RelatedArticlesFooterAllowedSkins. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753187 (https://phabricator.wikimedia.org/T298916) (owner: 10Clare Ming)