[00:00:07] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T0000). [00:00:07] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] o/ [00:00:31] (03PS4) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [00:05:16] (03PS1) 10Legoktm: Delete unused role::mediawiki::irc_events [puppet] - 10https://gerrit.wikimedia.org/r/747639 (https://phabricator.wikimedia.org/T272559) [00:05:36] thcipriani: is today the day we struggle to find deployers for the backport window? [00:05:51] !log published new versions of php7.{2,4}-fpm-multiversion-base image with php-yaml extension (T296331) [00:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:56] T296331: Install php-yaml for use by SettingsLoader - https://phabricator.wikimedia.org/T296331 [00:10:21] (03CR) 10Dzahn: miscweb: Set up static_tendril microsite (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [00:12:18] (03PS1) 10Cwhite: add and enable subset filters [software/ecs] - 10https://gerrit.wikimedia.org/r/747641 (https://phabricator.wikimedia.org/T294581) [00:12:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33027/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [00:14:03] urbanecm: are you still around? [00:14:21] (03CR) 10Dzahn: "I have some comments / follow-up but I'll just merge this and then create another change." [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [00:14:27] Jdlrobson: well, yes, but no. [00:14:33] tired enough not to trust myself to deploy [00:14:39] gotcha [00:14:41] sorry! [00:14:51] RoanKattouw: around or should I pick another day? [00:16:18] !log miscweb1002 - disable puppet, deploying gerrit:747600 on miscweb2002 first, indeed puppet problem detected T297605 [00:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:23] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [00:17:40] (03CR) 10Dzahn: "Error: Could not set 'file' on ensure: No such file or directory - A directory component in /srv/org/wikimedia/static-tendril/index.html20" [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [00:17:52] (03PS1) 10Ahmon Dancy: scap.cfg: Enable rsync_cdbs in beta [puppet] - 10https://gerrit.wikimedia.org/r/747643 (https://phabricator.wikimedia.org/T297326) [00:18:26] Jdlrobson: I'm on the east coast so it's late, sorry [00:18:33] np. I'll reschedule for tomorrow. [00:19:27] !log uploaded 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf5 to buster-wikimedia for T297667 [00:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:32] T297667: mysqli/mysqlnd memory leak - https://phabricator.wikimedia.org/T297667 [00:19:52] (03CR) 10Dzahn: [C: 03+2] scap.cfg: Enable rsync_cdbs in beta [puppet] - 10https://gerrit.wikimedia.org/r/747643 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [00:22:16] (03CR) 10Ahmon Dancy: "I was in the middle of writing a note when I saw that the change was already merged. Anyway, I tested beforehand directly on deployment-d" [puppet] - 10https://gerrit.wikimedia.org/r/747643 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [00:22:55] !log upgraded php7.2 on mw1414 for mysqlnd memory leak fix part 2 (T297667) [00:22:56] (03CR) 10Dzahn: scap.cfg: Enable rsync_cdbs in beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747643 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [00:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:32] Stepping away for 20-30 mins. [00:28:18] (03PS1) 10Dzahn: static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists [puppet] - 10https://gerrit.wikimedia.org/r/747646 [00:28:31] (03PS1) 10Dzahn: static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists [puppet] - 10https://gerrit.wikimedia.org/r/747647 (https://phabricator.wikimedia.org/T297605) [00:29:16] (03Abandoned) 10Dzahn: static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists [puppet] - 10https://gerrit.wikimedia.org/r/747646 (owner: 10Dzahn) [00:30:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33028/miscweb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/747647 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:30:31] (03CR) 10jerkins-bot: [V: 04-1] static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists [puppet] - 10https://gerrit.wikimedia.org/r/747647 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:30:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:32] (03PS2) 10Dzahn: static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists [puppet] - 10https://gerrit.wikimedia.org/r/747647 (https://phabricator.wikimedia.org/T297605) [00:31:44] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10HMarcus) Hi all - fundraising@wikimedia.org is a delegated inbox in our domain. Meaning it acts like a normal user account, but we have granted delegate access to... [00:35:19] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Thank you for the detailed and quick response @HMarcus I'll leave the personal access part to fundraising but I can confirm that where it says James Alexand... [00:36:03] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) [00:36:27] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) added fundraising-tech-ops [00:37:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:33] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:19] (03CR) 10Dzahn: "Info: Applying configuration version '(70a2db0a43) Dzahn - static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists'" [puppet] - 10https://gerrit.wikimedia.org/r/747647 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:43:35] (03CR) 10Dzahn: "Info: Applying configuration version '(70a2db0a43) Dzahn - static_tendril: use wmflib::dir::mkdir_p to ensure docroot exists'" [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [00:48:46] (03PS1) 10Dzahn: static-tendril: replace page title, remove Security Team reference [puppet] - 10https://gerrit.wikimedia.org/r/747650 (https://phabricator.wikimedia.org/T297605) [00:50:35] (03PS2) 10Dzahn: static-tendril: replace page title, remove Security Team reference [puppet] - 10https://gerrit.wikimedia.org/r/747650 (https://phabricator.wikimedia.org/T297605) [00:51:01] (03CR) 10Dzahn: [C: 03+2] static-tendril: replace page title, remove Security Team reference [puppet] - 10https://gerrit.wikimedia.org/r/747650 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:51:32] (03CR) 10Dzahn: [V: 03+2 C: 03+2] static-tendril: replace page title, remove Security Team reference [puppet] - 10https://gerrit.wikimedia.org/r/747650 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:51:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:53] (03CR) 10Dzahn: "follow-up to I8f61053bfe16fb" [puppet] - 10https://gerrit.wikimedia.org/r/747650 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:55:27] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:56:13] (03PS1) 10Dzahn: httpbb: add test for static-tendril to test_miscweb [puppet] - 10https://gerrit.wikimedia.org/r/747653 (https://phabricator.wikimedia.org/T297605) [00:58:17] (03CR) 10Dzahn: [C: 03+2] httpbb: add test for static-tendril to test_miscweb [puppet] - 10https://gerrit.wikimedia.org/r/747653 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [00:59:28] (03CR) 10Dzahn: ""+ assert_body_contains: tendril and dbtree" should work after I adjusted the page title in Ib8730dba72ff1b" [puppet] - 10https://gerrit.wikimedia.org/r/747653 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T0100). [01:03:24] !log removing current dump from static-codereview to replace it with a new one [01:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:48] (03CR) 10Dzahn: "2 observations here. first is I can't use httpbb from deployment_server anymore due to firewalling. this seems new. it used to work from e" [puppet] - 10https://gerrit.wikimedia.org/r/747653 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:05:45] (03CR) 10Dzahn: "Body: expected to contain 'wikiworkshop.org/2021', got ' PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:09:51] (03CR) 10Dzahn: "After fixing wikiworkshop.org test we get the actual issue with static-tendril. It's missing on the TLS cert. But we terminate TLS at envo" [puppet] - 10https://gerrit.wikimedia.org/r/747653 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:10:40] (03PS1) 10Dzahn: Revert "httpbb: add test for static-tendril to test_miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/747485 [01:13:37] (03CR) 10Dzahn: [C: 03+2] Revert "httpbb: add test for static-tendril to test_miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/747485 (owner: 10Dzahn) [01:19:43] (03PS1) 10Dzahn: httpbb: miscweb: fix tests for wikiworkshop.org, update 2021 to 2022 [puppet] - 10https://gerrit.wikimedia.org/r/747658 (https://phabricator.wikimedia.org/T297605) [01:21:38] (03CR) 10Dzahn: [C: 03+2] httpbb: miscweb: fix tests for wikiworkshop.org, update 2021 to 2022 [puppet] - 10https://gerrit.wikimedia.org/r/747658 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:21:51] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Legoktm) OK, I copied over the new dump to miscweb, the issues in T205361#6... [01:23:23] (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml --hosts miscweb1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/747658 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:36:19] (03PS1) 10Dzahn: miscweb/static_tendril: add dbtree.wikimedia.org as ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) [01:38:44] (03PS2) 10Dzahn: miscweb/static_tendril: add dbtree.wikimedia.org as ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) [01:39:30] (03CR) 10Dzahn: "I'll continue on this tomorrow, this is more fyi." [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [01:50:43] !log miscweb1002 - re-enabling puppet after deployment for T297605 [01:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:49] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [01:52:37] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:53:04] !log miscweb1002 / miscweb2002 - both backends 'PASS: 26 requests sent to miscweb1002.eqiad.wmnet. All assertions passed.' again after fixing httpbb tests and T297605 [01:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:35] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:02:41] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:49] (03PS1) 10Dzahn: miscweb/static-tendril: have separate apache access and error log [puppet] - 10https://gerrit.wikimedia.org/r/747665 (https://phabricator.wikimedia.org/T297605) [02:21:25] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:32:15] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:45:35] (03PS1) 10Andrew Bogott: Revert "Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004" [puppet] - 10https://gerrit.wikimedia.org/r/747667 (https://phabricator.wikimedia.org/T297814) [02:46:22] (03PS1) 10Andrew Bogott: Revert "make cloudmetrics1004 the primary cloudmetrics endpoint" [dns] - 10https://gerrit.wikimedia.org/r/747668 (https://phabricator.wikimedia.org/T297814) [02:47:34] (03CR) 10Andrew Bogott: [C: 03+2] Revert "make cloudmetrics1004 the primary cloudmetrics endpoint" [dns] - 10https://gerrit.wikimedia.org/r/747668 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [02:47:42] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004" [puppet] - 10https://gerrit.wikimedia.org/r/747667 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [03:27:15] !log Stopped rebuildItemsPerSite on mwmaint1002 (was slightly beyond item Q72056756), as it has a memory leak (and would OOM in a few days) [03:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:01] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:06:13] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:49] (03PS1) 10Gergő Tisza: Enable WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747677 (https://phabricator.wikimedia.org/T267273) [04:38:44] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747687 (owner: 10Juan90264) [05:34:47] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:59] (03PS1) 10Marostegui: report_users: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747681 (https://phabricator.wikimedia.org/T297618) [06:08:32] (03CR) 10Marostegui: "This was tested already" [software] - 10https://gerrit.wikimedia.org/r/747681 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:08:34] (03CR) 10Marostegui: [C: 03+2] report_users: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747681 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:09:05] (03Merged) 10jenkins-bot: report_users: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747681 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:16:15] (03PS1) 10Marostegui: section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747682 (https://phabricator.wikimedia.org/T297618) [06:16:45] (03CR) 10Marostegui: "This has been tested" [software] - 10https://gerrit.wikimedia.org/r/747682 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:16:57] (03CR) 10Marostegui: [C: 03+2] section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747682 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:17:31] (03Merged) 10jenkins-bot: section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747682 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:20:19] (03PS1) 10RLazarus: Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) [06:23:43] (03PS1) 10Marostegui: check_flags_per_dc.sh: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747684 (https://phabricator.wikimedia.org/T297618) [06:24:02] (03CR) 10Marostegui: "This has been tested" [software] - 10https://gerrit.wikimedia.org/r/747684 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:24:27] (03CR) 10Marostegui: [C: 03+2] check_flags_per_dc.sh: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747684 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:24:39] (03PS2) 10RLazarus: Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) [06:24:57] (03Merged) 10jenkins-bot: check_flags_per_dc.sh: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747684 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [06:26:09] (03PS3) 10RLazarus: Use the Kubernetes config API as it was in v7.0.0 (buster) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) [06:36:52] (03PS1) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) [06:38:38] (03CR) 10jerkins-bot: [V: 04-1] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [06:39:49] (03PS1) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747706 (https://phabricator.wikimedia.org/T287130) [06:41:28] (03CR) 10jerkins-bot: [V: 04-1] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747706 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [06:43:03] (03Abandoned) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747706 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [06:43:49] (03PS2) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) [06:45:30] (03CR) 10jerkins-bot: [V: 04-1] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [06:48:25] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10Arrbee) This is an approved request for Mary. [06:50:17] (03PS3) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) [06:52:09] * elukey waves to rzl :) [06:54:06] elukey: good morning! I'm off to bed [06:55:24] rzl: I imagined, have a good night :) [07:02:19] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:17] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={get,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:08:29] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [07:18:39] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:49:34] (03PS1) 10Marostegui: master-pos: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747790 (https://phabricator.wikimedia.org/T297618) [07:49:49] (03CR) 10Marostegui: "This has been tested" [software] - 10https://gerrit.wikimedia.org/r/747790 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [07:50:10] (03CR) 10Marostegui: [C: 03+2] master-pos: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747790 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [07:50:39] (03Merged) 10jenkins-bot: master-pos: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747790 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [07:51:17] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:53:19] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:03:37] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:05:10] jouncebot: nowandnext [08:05:11] No deployments scheduled for the next 2 hour(s) and 54 minute(s) [08:05:11] In 2 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1100) [08:05:28] (03PS2) 10Urbanecm: MentorPageMentorManager: Do not fail hard with no mentor list configured [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747484 (https://phabricator.wikimedia.org/T297827) [08:05:34] (03CR) 10Urbanecm: [C: 03+2] "UBN" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747484 (https://phabricator.wikimedia.org/T297827) (owner: 10Urbanecm) [08:07:38] !log restart blazegraph on wdqs1013 (jvm stuck for 4hours) [08:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:47] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:10:49] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:13:25] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [08:17:32] (03PS3) 10Hashar: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:17:53] (03CR) 10Hashar: [C: 03+1] "I have adjusted a few things in the commit message :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:19:43] (03CR) 10jerkins-bot: [V: 04-1] Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:20:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2024.codfw.wmnet with OS buster [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:11] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS buster [08:20:35] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10akosiaris) The problem described by the task (that is forwards from VRTS to donate@ failing) has been resolved. There was a configuration... [08:23:15] (03CR) 10Filippo Giunchedi: "LGTM overall!" [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [08:24:31] (03CR) 10Hashar: [C: 03+1] "CI fails due to mypy which is addressed by https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/747104" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [08:24:41] (03CR) 10Filippo Giunchedi: [C: 03+1] add and enable subset filters [software/ecs] - 10https://gerrit.wikimedia.org/r/747641 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [08:25:05] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:29:19] (03Merged) 10jenkins-bot: MentorPageMentorManager: Do not fail hard with no mentor list configured [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747484 (https://phabricator.wikimedia.org/T297827) (owner: 10Urbanecm) [08:29:34] finally [08:31:56] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/GrowthExperiments/: 35c055cead3d240625b76d21aa4e685525ca0d4b: MentorPageMentorManager: Do not fail hard with no mentor list configured (T297827) (duration: 01m 09s) [08:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:01] T297827: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T297827 [08:32:02] * urbanecm done, for now [08:32:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:13] (03CR) 10JMeybohm: Use the Kubernetes config API as it was in v7.0.0 (buster) (031 comment) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [08:34:14] (03CR) 10JMeybohm: [C: 04-1] imagecatalog: Pass cluster names along with config paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [08:39:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] !log switch ml-etcd2003 to DRBD-based storage to allow migration for reimages [08:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [08:48:51] (03PS1) 10KartikMistry: Set ContentTranslationContentImportForSectionTranslation for SX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) [08:49:59] (03CR) 10JMeybohm: [C: 03+1] Rakefile/rake_modules: remove unused function helm_version() and cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/747487 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:53:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10jcrespo) [08:55:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:55:59] !log drain primary/secondary instances off ganeti2015 T296622 [08:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:05] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [08:57:49] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:04:07] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:09:54] (03CR) 10Kosta Harlan: [C: 03+1] "Oops. Sorry I missed that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747677 (https://phabricator.wikimedia.org/T267273) (owner: 10Gergő Tisza) [09:11:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2024.codfw.wmnet with OS buster [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS buster completed: - ganeti2024 (**PASS**) - Downtimed on Icinga... [09:14:13] (03CR) 10Kosta Harlan: [C: 03+1] "I've scheduled it for the UTC morning backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747677 (https://phabricator.wikimedia.org/T267273) (owner: 10Gergő Tisza) [09:26:09] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:32:02] (03CR) 10Kormat: [C: 04-1] auto_schema: Move away from mysql.py (031 comment) [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [09:38:44] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [09:38:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [09:38:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [09:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:39] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [09:40:52] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2015. Ready to be powered off any time. [09:46:18] !log added ganeti2028 to ganeti codfw cluster T294139 [09:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:24] T294139: Q2:(Need By: TBD) rack/setup/install ganeti202[78].codfw.wmnet - https://phabricator.wikimedia.org/T294139 [09:50:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [09:50:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2004.codfw.wmnet with reason: switch to drbd storage [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:39] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:55:45] (03PS4) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) [09:57:41] (03PS1) 10Marostegui: Revert "section: Replace mysql.py with db-mysql" [software] - 10https://gerrit.wikimedia.org/r/747689 [09:59:12] (03CR) 10Marostegui: [C: 03+2] Revert "section: Replace mysql.py with db-mysql" [software] - 10https://gerrit.wikimedia.org/r/747689 (owner: 10Marostegui) [10:00:49] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:00:56] (03PS4) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) [10:02:01] (03PS1) 10Marostegui: section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) [10:03:22] (03CR) 10Kormat: "LGTM, with one minor (optional) comment." [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [10:04:04] (03PS2) 10Marostegui: section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) [10:04:15] (03PS2) 10Filippo Giunchedi: prometheus: consider non-discovery case when sending SNI in blackbox [puppet] - 10https://gerrit.wikimedia.org/r/747493 (https://phabricator.wikimedia.org/T291946) [10:04:24] (03CR) 10Kormat: [C: 03+1] section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [10:04:41] !log switched kubetcd2004 to DRBD-based storage to allow migration for reimages [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:55] (03CR) 10Marostegui: [C: 03+2] section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [10:05:20] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:06:02] (03Merged) 10jenkins-bot: section: Replace mysql.py with db-mysql [software] - 10https://gerrit.wikimedia.org/r/747799 (https://phabricator.wikimedia.org/T297618) (owner: 10Marostegui) [10:09:27] !log drain primary/secondary instances off ganeti2007 T296622 [10:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:32] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [10:10:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: consider non-discovery case when sending SNI in blackbox [puppet] - 10https://gerrit.wikimedia.org/r/747493 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:10:49] (03PS1) 10Kormat: master-pos: Simplify db-mysql params. [software] - 10https://gerrit.wikimedia.org/r/747803 (https://phabricator.wikimedia.org/T297618) [10:12:36] (03CR) 10Kormat: [C: 03+2] master-pos: Simplify db-mysql params. [software] - 10https://gerrit.wikimedia.org/r/747803 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [10:14:42] (03PS1) 10Kormat: host-to-instance: Switch to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/747804 (https://phabricator.wikimedia.org/T297618) [10:15:15] (03PS1) 10Filippo Giunchedi: hieradata: add more network probes for internal services [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) [10:15:38] (03CR) 10Kormat: [C: 03+2] "Testing works:" [software] - 10https://gerrit.wikimedia.org/r/747804 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [10:17:51] PROBLEM - Host urldownloader2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:56] (03PS1) 10Lucas Werkmeister (WMDE): Filter out non-string keys/values from query string before using [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747690 (https://phabricator.wikimedia.org/T297828) [10:19:43] (03PS1) 10Kormat: dbtools/sys/apply: Switch to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/747807 (https://phabricator.wikimedia.org/T297618) [10:20:18] (03CR) 10Kormat: [C: 03+2] dbtools/sys/apply: Switch to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/747807 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [10:23:16] (03CR) 10Jbond: [C: 03+1] "LGTM <3" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [10:24:23] (03PS2) 10Filippo Giunchedi: hieradata: add more network probes for internal services [puppet] - 10https://gerrit.wikimedia.org/r/747805 (https://phabricator.wikimedia.org/T291946) [10:24:53] (03PS1) 10Kormat: switchover-tmpl.sh: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747808 (https://phabricator.wikimedia.org/T297618) [10:27:05] (03CR) 10Kormat: [C: 03+2] switchover-tmpl.sh: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747808 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [10:28:41] !log second attempt to reimage kafka-main2003 to buster [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:26] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main2003.codfw.wmnet with OS buster [10:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:24] 10SRE, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Volans) 05Resolved→03Open I'm re-opening this as a follow up from a chat in [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/74562... [10:33:29] (03CR) 10Jbond: [C: 03+1] "LGTM but please also collect one from Tyler to make sure they are happy" [puppet] - 10https://gerrit.wikimedia.org/r/747463 (owner: 10MVernon) [10:41:27] 10SRE, 10Data-Services, 10Discovery-Search, 10Wikidata, and 3 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) Note that the checksum files for those dumps are available for download as well, since they are provided along with the main dump ou... [10:42:24] (03PS1) 10Kormat: check-master-heartbeat.sh: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747810 (https://phabricator.wikimedia.org/T297618) [10:45:19] (03CR) 10Jbond: sre.hosts.provision: add new cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:45:35] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:46:12] (03CR) 10Kormat: [C: 03+2] "Tested on cumin1001." [software] - 10https://gerrit.wikimedia.org/r/747810 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [10:46:54] (03CR) 10Volans: sre.hosts.provision: add new cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:47:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:49:15] (03CR) 10Jbond: [C: 03+1] "thx LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:50:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [10:51:17] (03CR) 10Volans: [C: 03+2] spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:55:24] (03PS1) 10Kormat: change_mw_mysql_pass: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747811 (https://phabricator.wikimedia.org/T297618) [10:55:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [10:57:11] (03Merged) 10jenkins-bot: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [10:58:26] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:59:10] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 269 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [10:59:28] !log pushed new packages for druid version 0.19.0-2 on buster using reprepro [10:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:46] kafka-main2002 is expected, I missed to downtime it, no problem (2003 under reimage) [11:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1100). [11:00:20] RECOVERY - Host urldownloader2001 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms [11:02:22] PROBLEM - Check systemd state on urldownloader2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:22] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [11:04:21] (03PS1) 10Jelto: helmfile.d/admin_ng: change ci deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/747814 (https://phabricator.wikimedia.org/T297809) [11:07:27] (03PS2) 10Ladsgroup: auto_schema: Move away from mysql.py [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) [11:07:29] (03PS3) 10Ladsgroup: auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) [11:08:08] (03CR) 10Ladsgroup: auto_schema: Move away from mysql.py (031 comment) [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:08:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2003.codfw.wmnet with OS buster [11:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:45] (03CR) 10Kormat: [C: 03+1] "LGTM" [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:12:10] (03CR) 10Ladsgroup: [C: 03+1] Delete unused role::mediawiki::irc_events [puppet] - 10https://gerrit.wikimedia.org/r/747639 (https://phabricator.wikimedia.org/T272559) (owner: 10Legoktm) [11:14:02] PROBLEM - Check whether ferm is active by checking the default input chain on urldownloader2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:14:44] (03PS11) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [11:14:57] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [11:15:21] (03CR) 10Lucas Werkmeister (WMDE): wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [11:15:23] (03CR) 10Kormat: [C: 03+1] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [11:17:20] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey) 05Resolved→03Open @Papaul the upgrade worked, I reimaged kafka-main2003 this morning! I'd need to upgrade kafka-mai... [11:18:15] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [11:18:39] (03CR) 10Marostegui: [C: 03+1] change_mw_mysql_pass: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747811 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [11:18:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As it is, this patch would break the wikimedia clusters, so we can't accept it as-is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [11:18:55] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Move away from mysql.py [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:19:28] (03Merged) 10jenkins-bot: auto_schema: Move away from mysql.py [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:20:07] (03CR) 10Kormat: [C: 03+2] change_mw_mysql_pass: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747811 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [11:20:53] (03CR) 10Marostegui: "We should probably log this on the log, with something like:" [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:21:53] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10LSobanski) I see codfw is configured with 4 different racks so I don't see why we wouldn't do the same thing here. cc @fgiunchedi in case there's something... [11:23:24] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:25:28] RECOVERY - Check systemd state on urldownloader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:41] (03CR) 10JMeybohm: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [11:25:53] (03PS12) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [11:27:03] (03PS1) 10Kormat: depool-and-wait: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747817 (https://phabricator.wikimedia.org/T297618) [11:27:50] (03CR) 10Ladsgroup: auto_schema: Add a timeout for depooling + downtime replicas for longer (031 comment) [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:28:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [11:29:25] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:29:55] (03CR) 10Jbond: [C: 03+1] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:30:10] (03CR) 10Jbond: [C: 03+2] pcc: replace compiler1001 with pcc-worker1003 [puppet] - 10https://gerrit.wikimedia.org/r/746893 (https://phabricator.wikimedia.org/T297356) (owner: 10David Caro) [11:30:51] (03CR) 10Kormat: [C: 03+2] depool-and-wait: Switch to db-mysql [software] - 10https://gerrit.wikimedia.org/r/747817 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [11:31:32] (03PS1) 10Jelto: helmfile.d/admin_ng: fix subjects of rolebinding in namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/747819 (https://phabricator.wikimedia.org/T251305) [11:38:04] (03CR) 10JMeybohm: [C: 03+1] helmfile.d/admin_ng: fix subjects of rolebinding in namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/747819 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:41:52] (03PS11) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:43:09] (03PS1) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [11:44:00] (03CR) 10Volans: sre.hosts.provision: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [11:44:40] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:45:06] RECOVERY - Check whether ferm is active by checking the default input chain on urldownloader2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:49:24] (03PS1) 10Ladsgroup: mariadb: Move grant files from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/747823 [11:50:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Move grant files from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/747823 (owner: 10Ladsgroup) [11:53:54] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0111 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:54:42] (03PS1) 10Muehlenhoff: Update Hiera setting for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/747824 [11:55:18] (03PS1) 10Ladsgroup: mariadb: Fix path to another file [puppet] - 10https://gerrit.wikimedia.org/r/747825 [11:56:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Fix path to another file [puppet] - 10https://gerrit.wikimedia.org/r/747825 (owner: 10Ladsgroup) [12:00:05] Amir1, Lucas_WMDE, and apergos: Time to snap out of that daydream and deploy UTC morning backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1200). [12:00:05] kostajh: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] I'm here [12:00:16] o/ [12:00:26] (in a meeting rn but could deploy in ~15mins) [12:00:32] "Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL" <- That's me [12:00:33] (03CR) 10Muehlenhoff: [C: 03+2] Update Hiera setting for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/747824 (owner: 10Muehlenhoff) [12:00:37] fix is rolling out [12:00:44] I am here [12:00:50] there are no trasinees for this window [12:00:59] there is one patch only in the window [12:01:07] I can deploy my patch [12:01:22] I cannot assess its impact in any reasonable way but at least it is a one file - one line change :-D [12:01:22] feel free to (unles Amir1 objects) [12:01:38] self-deployment is fine but let's wait for Amir's thing to go around first [12:01:41] no issue on my side [12:01:56] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/747826 [12:02:06] actually how long wil your one take to merge, kostajh? if it's a bit, might as well start on it now [12:02:19] apergos: it's a config change, so just a few minutes [12:02:36] Amir1: let us know when everything's back to normal [12:03:13] cumin says 6 minutes but don't wait for it [12:03:17] it's not a big deal [12:03:49] well the merge can go in now and we can wait the 6 minutes in case it merges before yours is done [12:03:51] no biggie [12:04:00] kostajh: ^ [12:04:14] ok, thx [12:04:39] (03PS12) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [12:05:19] (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747677 (https://phabricator.wikimedia.org/T267273) (owner: 10Gergő Tisza) [12:06:03] (03Merged) 10jenkins-bot: Enable WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747677 (https://phabricator.wikimedia.org/T267273) (owner: 10Gergő Tisza) [12:08:23] (03PS4) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:09:00] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [12:09:34] (testing on mwdebug1002 now) [12:10:01] awesome [12:10:13] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [12:10:42] (03Merged) 10jenkins-bot: auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [12:10:53] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MatthewVernon) [12:11:22] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [12:11:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [12:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:48] How does it look? [12:12:17] good [12:12:19] syncing [12:12:28] or should I wait? [12:12:44] ^ Amir1 apergos [12:12:52] nah, go ahead [12:12:55] lol [12:12:56] it's mostly clean now [12:13:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:41] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MatthewVernon) As well as `L3` signature, this request will also need approval from one of the approvers for the analytics-privatedata-users LDAP group - @odimitri... [12:14:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:33] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747677|Enable WelcomeSurvey Interaction schema (T267273 T297858)]] (duration: 01m 07s) [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] T297858: Event submitted for unregistered stream name "mediawiki.welcomesurvey.interaction". - https://phabricator.wikimedia.org/T297858 [12:14:39] T267273: [arwiki] Submitting a POST on a form redirected to immediately after account creation sometimes logs user out - https://phabricator.wikimedia.org/T267273 [12:14:48] \o/ [12:15:12] all done :) [12:16:47] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/747826 (owner: 10Volans) [12:17:33] I like how cumin's 5 minutes is basically the way we use "5 minutes" in Iran. It took 20 minutes. [12:17:40] :-D [12:17:58] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003331 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:18:15] here we go [12:20:05] (03PS1) 10Volans: Upstream release v1.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/747832 [12:20:42] Amir1: lol, but cumin doesn't have a predicted time to completion... from where did you get it? :D [12:21:55] | 100% (220/220) [21:55<00:00, 5.52s/hosts] [12:24:37] welp, given that the graphs look reasonable and that was the only patch, I guess that's the end of today's UTC morning backport window [12:28:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [12:31:02] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 28 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:33:14] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:33:56] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:25] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:59] (03CR) 10Elukey: [C: 03+1] Rakefile/rake_modules: remove unused function helm_version() and cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/747487 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:45:37] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:01] (03CR) 10Elukey: "Due to the broad scope of this code change, I'll probably deploy it (with observability's supervision) after the holidays :)" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [12:47:26] (03CR) 10Volans: [C: 03+2] Upstream release v1.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/747832 (owner: 10Volans) [12:49:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10fgiunchedi) yes +1 to spread around rows as much as we can [12:53:12] (03Merged) 10jenkins-bot: Upstream release v1.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/747832 (owner: 10Volans) [12:56:00] (03PS1) 10Filippo Giunchedi: prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) [12:56:02] (03PS1) 10Filippo Giunchedi: hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) [12:58:49] (03PS5) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:59:25] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [13:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1300) [13:08:42] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:10:50] (03CR) 10Btullis: [C: 03+2] Add a single node from the aqs_next cluster to the pool [puppet] - 10https://gerrit.wikimedia.org/r/747540 (https://phabricator.wikimedia.org/T297803) (owner: 10Btullis) [13:12:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [13:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:46] (03CR) 10Jelto: [C: 03+2] helmfile.d/admin_ng: fix subjects of rolebinding in namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/747819 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:17:14] (03PS1) 10Giuseppe Lavagetto: mediawiki: inject x-client-ip from envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/747838 (https://phabricator.wikimedia.org/T297613) [13:17:33] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: inject x-client-ip from envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/747838 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [13:17:51] !log uploaded spicerack_1.1.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [13:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [13:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@e36c241] (codfw): (no justification provided) [13:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:20] (03PS2) 10Giuseppe Lavagetto: mediawiki: inject x-client-ip from envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/747838 (https://phabricator.wikimedia.org/T297613) [13:24:40] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@e36c241] (codfw): (no justification provided) (duration: 03m 12s) [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:22] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@e36c241] (eqiad): Change osm-intl and osm source to get MVT from Tegola (Full production for Tegola) [13:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:00] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@e36c241] (eqiad): Change osm-intl and osm source to get MVT from Tegola (Full production for Tegola) (duration: 01m 39s) [13:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] (03PS2) 10Muehlenhoff: Failover idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/745850 [13:32:12] (03CR) 10Jelto: [C: 03+2] Rakefile/rake_modules: remove unused function helm_version() and cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/747487 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:33:52] (03PS1) 10Ladsgroup: beta: Set wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747840 (https://phabricator.wikimedia.org/T297708) [13:33:57] !log upgraded spicerack to v1.1.0 on cumin[1001,2001] [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] (03CR) 10Ladsgroup: [C: 03+2] beta: Set wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747840 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [13:35:29] (03Merged) 10jenkins-bot: beta: Set wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747840 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [13:36:14] (03Merged) 10jenkins-bot: Rakefile/rake_modules: remove unused function helm_version() and cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/747487 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:36:17] rebased [13:36:25] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [13:38:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:33] (03PS1) 10Ladsgroup: microsites: Fix typo in description [puppet] - 10https://gerrit.wikimedia.org/r/747842 (https://phabricator.wikimedia.org/T297605) [13:41:26] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] microsites: Fix typo in description [puppet] - 10https://gerrit.wikimedia.org/r/747842 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [13:43:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:23] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10krobinson) Thanks @akosiaris - this part is indeed solved! I also agree that it makes sense to hand these all over to ITS, if that is... [13:46:22] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/745850 (owner: 10Muehlenhoff) [13:47:19] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [13:48:28] (03CR) 10Michael Große: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [13:48:36] (03CR) 10Jbond: "LGTM left some comments but mostly around style, nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:49:54] (03Merged) 10jenkins-bot: sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [13:50:07] jouncebot: nowandnext [13:50:07] For the next 0 hour(s) and 9 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1300) [13:50:07] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1400) [13:51:26] hashar: dancy: I intend to stash at a debug srv to verify T297827 is resolved. Does that sound like a good idea to do now? [13:51:26] T297827: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T297827 [13:54:37] (03CR) 10David Caro: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [13:59:55] If train is blocked and it's not happening right now, I have a big set of backports I'm planning to do, let me know [14:00:04] hashar and dancy: That opportune time is upon us again. Time for a MediaWiki train - Utc-0+Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1400). [14:00:09] (03PS6) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [14:01:53] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:02:17] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10Ottomata) Approved! [14:02:37] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [14:02:46] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configuration: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [14:02:49] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configuration: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [14:02:51] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configuratir: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [14:04:05] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configuration: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 (owner: 10Arturo Borrero Gonzalez) [14:04:28] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configuration: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 (owner: 10Arturo Borrero Gonzalez) [14:04:38] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configuratir: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [14:06:35] PROBLEM - Check systemd state on ms-be2065 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:57] (03PS1) 10Volans: sre.ganeti: fix get_locations() to support drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747852 [14:09:49] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:10:11] Amir1: first, let me test i indeed fixed a train blocker [14:10:20] sure [14:10:30] I'm still testing in beta cluster [14:10:38] ack [14:13:19] (03CR) 10BBlack: [C: 03+1] sre.ganeti: fix get_locations() to support drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747852 (owner: 10Volans) [14:13:37] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:47] (03CR) 10Volans: [C: 03+2] sre.ganeti: fix get_locations() to support drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747852 (owner: 10Volans) [14:14:47] promoting cawiki to wmf.13 at a debug srv [14:15:23] and things...work! [14:15:37] declaring T297827 resolved [14:15:38] T297827: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T297827 [14:16:01] let me know once you're done [14:16:20] (03Merged) 10jenkins-bot: sre.ganeti: fix get_locations() to support drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747852 (owner: 10Volans) [14:16:25] Amir1: that just happened :)) [14:16:55] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/747791 should likely get backported at some point. The invalidating method should not be called anything ATM, but just in case [14:16:57] (03PS1) 10Ladsgroup: rdbms: add query timeout support to Database::select() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747692 (https://phabricator.wikimedia.org/T129093) [14:17:13] but, that can wait (and i need to go now anyway :D) [14:17:29] (03PS1) 10Ladsgroup: Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747693 (https://phabricator.wikimedia.org/T297708) [14:17:36] (03CR) 10Ladsgroup: [C: 03+2] Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747693 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:17:43] (03PS5) 10Juan90264: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) [14:17:45] (03PS1) 10Ladsgroup: Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747694 (https://phabricator.wikimedia.org/T297708) [14:17:52] thanks. [14:18:07] (03CR) 10Ladsgroup: [C: 03+2] Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747694 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:18:24] (03PS1) 10Ladsgroup: Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747695 (https://phabricator.wikimedia.org/T297708) [14:18:36] (03PS1) 10Ladsgroup: Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747696 (https://phabricator.wikimedia.org/T297708) [14:20:27] (03PS2) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [14:22:17] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configuration: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [14:23:32] (03PS1) 10Muehlenhoff: Fix hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/747854 [14:24:08] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configuration: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [14:24:10] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configuratir: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [14:26:01] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configuratir: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [14:26:05] (03PS6) 10Juan90264: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) [14:29:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/747854 (owner: 10Muehlenhoff) [14:33:59] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey) [14:36:44] (03PS1) 10Ladsgroup: Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747697 (https://phabricator.wikimedia.org/T297708) [14:37:12] (03PS1) 10Ladsgroup: Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747698 (https://phabricator.wikimedia.org/T297708) [14:37:31] (03CR) 10Ladsgroup: [C: 03+2] rdbms: add query timeout support to Database::select() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747692 (https://phabricator.wikimedia.org/T129093) (owner: 10Ladsgroup) [14:38:10] (03PS7) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [14:38:48] (03PS1) 10Ladsgroup: Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747699 (https://phabricator.wikimedia.org/T297147) [14:39:08] (03PS1) 10Ladsgroup: Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747700 (https://phabricator.wikimedia.org/T297147) [14:39:37] (03PS3) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [14:41:37] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [14:44:24] (03Merged) 10jenkins-bot: Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747693 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:44:27] (03Merged) 10jenkins-bot: Add a config to pass the test [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747694 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:44:32] !log drain primary/secondary instances off ganeti2007 T296622 [14:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:38] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [14:46:53] (03PS1) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [14:46:55] (03PS1) 10MMandere: site: Add drmrs bastion host [puppet] - 10https://gerrit.wikimedia.org/r/747856 (https://phabricator.wikimedia.org/T282787) [14:48:43] (03PS1) 10Ladsgroup: Gradual roll out of $wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747857 (https://phabricator.wikimedia.org/T297708) [14:49:39] (03PS2) 10Filippo Giunchedi: prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) [14:49:41] (03PS2) 10Filippo Giunchedi: hieradata: add zotero and helm-charts probes [puppet] - 10https://gerrit.wikimedia.org/r/747836 (https://phabricator.wikimedia.org/T291946) [14:50:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:56] (03PS1) 10Cmjohnson: add new prometheus servers to site.pp in setup role [puppet] - 10https://gerrit.wikimedia.org/r/747858 (https://phabricator.wikimedia.org/T294967) [14:51:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:38] (03CR) 10Filippo Giunchedi: "Thank you for the review" [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:51:55] (03CR) 10Cmjohnson: [C: 03+2] add new prometheus servers to site.pp in setup role [puppet] - 10https://gerrit.wikimedia.org/r/747858 (https://phabricator.wikimedia.org/T294967) (owner: 10Cmjohnson) [14:53:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2005 is CRITICAL: 69 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2005 [14:53:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Cmjohnson) [14:53:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 335 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [14:54:39] (03CR) 10Ssingh: [C: 03+1] site: Add drmrs bastion host [puppet] - 10https://gerrit.wikimedia.org/r/747856 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [14:55:15] downtimed the kafka alerts, it is me [14:55:35] !log shutdown kafka-main2001 for BIOS+NIC firmware upgrades [14:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] jouncebot: nowandnext [14:56:03] For the next 1 hour(s) and 3 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1400) [14:56:03] In 2 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1700) [14:56:19] (03PS4) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [14:56:21] (03PS1) 10Cmjohnson: Adding new ganeti hosts to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/747860 (https://phabricator.wikimedia.org/T293909) [14:57:14] (03CR) 10Ladsgroup: [C: 03+2] Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747696 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:57:23] (03CR) 10Cmjohnson: [C: 03+2] Adding new ganeti hosts to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/747860 (https://phabricator.wikimedia.org/T293909) (owner: 10Cmjohnson) [14:57:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:57:25] (03CR) 10Ladsgroup: [C: 03+2] Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747695 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [14:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:58:27] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MatthewVernon) [14:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1001.wikimedia.org [14:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:41] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MatthewVernon) @MaryMunyoki this request is good to go once we have confirmation you've signed the L3 document. [15:00:18] (03PS4) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [15:00:46] (03Merged) 10jenkins-bot: rdbms: add query timeout support to Database::select() [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747692 (https://phabricator.wikimedia.org/T129093) (owner: 10Ladsgroup) [15:01:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1001.wikimedia.org [15:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:50] (03CR) 10Lucas Werkmeister (WMDE): wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [15:02:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:25] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/libs/rdbms/database/: Backport: [[gerrit:747692|rdbms: add query timeout support to Database::select() (T129093 T195792)]] (duration: 01m 11s) [15:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] T195792: Add support for setting individual query timeout in wikimedia/rdbms - https://phabricator.wikimedia.org/T195792 [15:03:33] T129093: SHOW SLAVE STATUS as a health check should have a low timeout - https://phabricator.wikimedia.org/T129093 [15:04:06] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:04:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Papaul) [15:05:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:06:19] (03PS1) 10Btullis: Merge branch 'master' into debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/747863 (https://phabricator.wikimedia.org/T297468) [15:06:48] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:02] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:11:01] papaul: did you made other changes on netbox not yet commited by any chance? ^^^ [15:12:12] volans: no just elastic1084 [15:12:29] (03CR) 10Ottomata: [C: 03+1] Merge branch 'master' into debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/747863 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [15:12:45] ack, thx [15:12:47] (03CR) 10Btullis: [V: 03+2 C: 03+2] Merge branch 'master' into debian [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/747863 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [15:15:21] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:52] RECOVERY - DPKG on maps2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:16:02] (03PS1) 10Lucas Werkmeister (WMDE): bridge: fix terms of service and copyright missing [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747701 [15:18:47] (03PS2) 10MMandere: site: Add drmrs bastion host [puppet] - 10https://gerrit.wikimedia.org/r/747856 (https://phabricator.wikimedia.org/T282787) [15:18:51] (03Merged) 10jenkins-bot: Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747696 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [15:19:00] (03Merged) 10jenkins-bot: Allow setting max execution time to several special pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747695 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [15:19:13] jouncebot: nowandnext [15:19:14] For the next 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1400) [15:19:14] In 1 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1700) [15:19:33] is the train happening? [15:19:46] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:49] I don't think so [15:19:55] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:19:55] I'm deploying though [15:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:29] (03CR) 10Jbond: [C: 03+1] "this look much better 😊. left a very minor nit and would be nice t have a spec test for the function but no blocking" [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:20:30] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:40] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:21:51] (03CR) 10BBlack: [C: 03+1] site: Add drmrs bastion host [puppet] - 10https://gerrit.wikimedia.org/r/747856 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [15:22:04] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2005 [15:22:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:22:32] (03CR) 10MMandere: [C: 03+2] site: Add drmrs bastion host [puppet] - 10https://gerrit.wikimedia.org/r/747856 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:42] Hey folks [15:23:00] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [15:23:21] (03PS5) 10Muehlenhoff: Make ganeti2026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747822 [15:24:09] would it be okay for me to deploy a wmf.13 backport? (should only affect cawiki) [15:24:17] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:25] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:36] jouncebot now [15:24:36] For the next 0 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1400) [15:24:47] (03PS2) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [15:25:17] Lucas_WMDE: Seems like you should be ok [15:25:30] alright, thanks [15:25:33] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backporting" [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747701 (owner: 10Lucas Werkmeister (WMDE)) [15:26:03] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/747690 should also be done at some point but I’d lilke to have Eric or Roan around for that [15:26:25] you will see some syntax errors for mwdebug, ignore those [15:26:44] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:09] Lucas_WMDE: hit the +2 [15:27:23] I'm deploying but it's almost done [15:27:29] I did [15:27:37] it’ll take some 20 minutes in gate-and-submit anyways ^^ [15:28:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though please note that the URLs here will be checked by Prometheus hosts in all sites" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [15:28:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:49] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/includes/DefaultSettings.php: Backport: [[gerrit:747695|Allow setting max execution time to several special pages (T297708)], Part I (duration: 01m 06s) [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:54] T297708: Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 [15:32:09] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/includes/: Backport: [[gerrit:747695|Allow setting max execution time to several special pages (T297708)], Part II (duration: 01m 12s) [15:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:54] (03CR) 10Ladsgroup: [C: 03+2] Gradual roll out of $wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747857 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [15:34:12] (03Merged) 10jenkins-bot: Gradual roll out of $wgMaxExecutionTimeForExpensiveQueries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747857 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [15:34:19] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/DefaultSettings.php: Backport: [[gerrit:747696|Allow setting max execution time to several special pages (T297708)], Part I (duration: 01m 05s) [15:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/: Backport: [[gerrit:747696|Allow setting max execution time to several special pages (T297708)], Part II (duration: 01m 11s) [15:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:09] the one third of max exec time is going out. cc. marostegui [15:36:17] sweet [15:36:46] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Papaul) [15:36:57] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:747857|Gradual roll out of $wgMaxExecutionTimeForExpensiveQueries (T297708)]] (duration: 01m 06s) [15:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:02] T297708: Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 [15:37:50] Lucas_WMDE: I'm done for now, I have two more deployments I'm planning to move but it can wait for now [15:38:12] ack [15:38:17] (03PS1) 10Matthias Mullie: Add MediaSearch profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) [15:38:45] maybe I should backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/747690 as well… I think I know how to test it, at least [15:39:05] (03CR) 10Matthias Mullie: [C: 04-1] "Patch that depends on this has not yet been properly tested & reviewed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747868 (https://phabricator.wikimedia.org/T297863) (owner: 10Matthias Mullie) [15:39:56] (03CR) 10JMeybohm: [C: 03+1] helmfile.d/admin_ng: change ci deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/747814 (https://phabricator.wikimedia.org/T297809) (owner: 10Jelto) [15:40:20] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) This is a prototype version of the (trivial/non-massive) recovery script, interactive version: {F34886587} I got ins... [15:40:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:41:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) [15:42:01] !log shutdown kafka-main2002 for BIOS+NIC firmware upgrades [15:42:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] ok, I can reproduce that MediaSearch array-to-string-conversion error on mwdebug, so unless someone objects I’ll backport the fix for that after the Wikidata Bridge thing [15:42:59] and that should take care of that train blocker [15:43:19] Please do! [15:45:43] (03PS1) 10Cmjohnson: Adding kubernetes1022 to site.pp and netboot.cfg setup role [puppet] - 10https://gerrit.wikimedia.org/r/747871 (https://phabricator.wikimedia.org/T294301) [15:46:04] (03PS2) 10Cmjohnson: Adding kubernetes1022 to site.pp and netboot.cfg setup role [puppet] - 10https://gerrit.wikimedia.org/r/747871 (https://phabricator.wikimedia.org/T294301) [15:48:10] (03CR) 10Cmjohnson: [C: 03+2] Adding kubernetes1022 to site.pp and netboot.cfg setup role [puppet] - 10https://gerrit.wikimedia.org/r/747871 (https://phabricator.wikimedia.org/T294301) (owner: 10Cmjohnson) [15:51:46] (03Merged) 10jenkins-bot: bridge: fix terms of service and copyright missing [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747701 (owner: 10Lucas Werkmeister (WMDE)) [15:52:26] testing ^ on mwdebug1001 [15:52:51] seems to work [15:52:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "backporting" [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747690 (https://phabricator.wikimedia.org/T297828) (owner: 10Lucas Werkmeister (WMDE)) [15:54:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/Wikibase/client/data-bridge/: Backport: [[gerrit:747701|bridge: fix terms of service and copyright missing]] (duration: 01m 06s) [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus1005.eqiad.wmnet with OS bullseye [15:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host prometheus1005.eqiad.wmnet with OS bull... [15:58:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:47] (03PS1) 10Muehlenhoff: Remove LDAP entry which is already present for shell access [puppet] - 10https://gerrit.wikimedia.org/r/747872 [15:59:20] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [15:59:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus1006.eqiad.wmnet with OS bullseye [16:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host prometheus1006.eqiad.wmnet with OS bull... [16:03:31] Lucas_WMDE: can I hit +2 on my patches [16:03:44] it'll take twenty minutes-ish [16:03:47] sure [16:03:56] the MediaSearch backport should be done soon and it shouldn’t take me that long to test it [16:04:00] (03CR) 10Ladsgroup: [C: 03+2] Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747698 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [16:04:05] (03CR) 10Ladsgroup: [C: 03+2] Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747697 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [16:05:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1022.eqiad.wmnet with OS bullseye [16:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bull... [16:07:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS buster [16:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster [16:07:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS buster [16:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster [16:09:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS buster [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster [16:09:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS buster [16:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster [16:10:09] hello [16:10:34] dancy: hashar you might see new errors like "Error 1969: Query execution was interrupted" (I'm seeing them already) but that's intentional. Just remove them from logs. [16:10:45] s/remove/hidee [16:10:55] ah we have just started the train log triage meeting [16:11:03] "(max_statement_time exceeded)" [16:12:15] when are you planning to push wmf.13 to group1 and 2? [16:12:36] no idea [16:12:43] I am catching up with the blocker task [16:12:51] I guess we will do group 1 then group2 immediately after [16:13:27] urbanecm: thank you for the cawiki login fix! :] [16:13:42] hashar: no problem! [16:15:01] oh and Lucas_WMDE is doing the mediasearch one ;) [16:15:07] :) [16:18:52] (03Merged) 10jenkins-bot: Filter out non-string keys/values from query string before using [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747690 (https://phabricator.wikimedia.org/T297828) (owner: 10Lucas Werkmeister (WMDE)) [16:18:54] (03Merged) 10jenkins-bot: Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747698 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [16:18:56] (03Merged) 10jenkins-bot: Set a maximum allowed time for db queries [extensions/intersection] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747697 (https://phabricator.wikimedia.org/T297708) (owner: 10Ladsgroup) [16:19:19] lol it merged all together [16:19:21] Amir1: I’ll rebase wmf.13 on @{u}^ to pull in only the MediaSearch change [16:19:29] so you’ll still have to rebase for that intersection thing [16:19:33] (whatever that is… it contains a copy of DPL?) [16:19:33] sure [16:19:40] it is DPL [16:19:41] (03PS5) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [16:19:44] just weird name [16:19:47] ah [16:20:01] “intersection” as in, a list of pages within an intersection of categories, as the original use case? [16:20:03] or whatever [16:20:20] (03PS6) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [16:20:47] yup [16:20:59] https://www.mediawiki.org/wiki/Extension:DynamicPageList [16:21:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus1005.eqiad.wmnet with OS bullseye [16:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host prometheus1005.eqiad.wmnet with OS bullseye... [16:21:10] oh good my WikimediaDebug isn’t working [16:21:11] DynamicPageList (Wikimedia), also known as Intersection [16:21:12] * Lucas_WMDE digs up curl [16:21:20] oh right I forgot there’s several of them [16:22:41] welp, I just realized I backported to the wrong branch [16:22:46] well, not exactly [16:22:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [16:22:48] it’s nice to have it fixed on wmf.13 [16:22:56] but I can’t test it there while Commons is on wmf.12, which also had the bug ^^ [16:23:09] let’s try test-commons [16:23:38] ok fix seems to work on test-commons, I’ll sync it [16:25:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/MediaSearch/: Backport: [[gerrit:747690|Filter out non-string keys/values from query string before using (T297828)]] (duration: 01m 06s) [16:25:17] (03PS7) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [16:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:20] T297828: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T297828 [16:25:43] alright, I’m done for now [16:25:47] Amir1: go ahead [16:25:53] awesome [16:26:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:26:10] I’ll create a wmf.12 cherry-pick of the MediaSearch backport but not sure if I’ll hvae time to deploy it today [16:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus1006.eqiad.wmnet with OS bullseye [16:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host prometheus1006.eqiad.wmnet with OS bullseye... [16:26:18] (03PS1) 10Lucas Werkmeister (WMDE): Filter out non-string keys/values from query string before using [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747702 (https://phabricator.wikimedia.org/T297828) [16:27:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:37] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [16:27:55] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10Aklapper) (Rather unrelated: Is there a particular reason the [Phab account](https://phabricator.wikimedia.org/p/MaryMunyoki/) is linked to a personal SUL account... [16:28:47] looks fine, moving forward [16:28:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1022.eqiad.wmnet with OS bullseye [16:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye... [16:29:22] (03PS13) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [16:30:08] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/intersection: Backport: [[gerrit:747697|Set a maximum allowed time for db queries (T297708)]] (duration: 01m 05s) [16:30:12] (03CR) 10Ladsgroup: [C: 03+2] Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747699 (https://phabricator.wikimedia.org/T297147) (owner: 10Ladsgroup) [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] T297708: Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 [16:30:15] (03CR) 10Ladsgroup: [C: 03+2] Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747700 (https://phabricator.wikimedia.org/T297147) (owner: 10Ladsgroup) [16:32:30] (03PS8) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [16:32:41] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/intersection: Backport: [[gerrit:747698|Set a maximum allowed time for db queries (T297708)]] (duration: 01m 06s) [16:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:33:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) I wrote a small script to grep operations-puppet and cloud-instance-puppet with the class names pending above, and got this: {F34886637} The ones tha... [16:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:00] (03PS9) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [16:38:47] (03CR) 10Herron: "It occurred to me that some urls will need connectivity outward, so have updated this to include a module to check via the local http prox" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [16:39:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1027.eqiad.wmnet with OS buster [16:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with... [16:40:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1028.eqiad.wmnet with OS buster [16:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with... [16:43:15] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Ladsgroup) Update: From now on, any db queries made by DPL has timeout of ten seconds. Th... [16:44:25] (03PS1) 10Lucas Werkmeister (WMDE): logspam: Consolidate max_statement_time errors [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) [16:46:58] (03Abandoned) 10Ahmon Dancy: WIP: fix logspam script [puppet] - 10https://gerrit.wikimedia.org/r/577657 (owner: 10C. Scott Ananian) [16:47:03] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, I don't know how to deploy this though. I add Cole and Filippo" [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) (owner: 10Lucas Werkmeister (WMDE)) [16:47:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > I wrote a small script to grep operations-puppet and cloud-instance-puppet Did you also see `utils/audit.py` in the puppet repo would be good to m... [16:47:32] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1025.eqiad.wmnet with OS buster [16:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with... [16:48:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1026.eqiad.wmnet with OS buster [16:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with... [16:48:10] (03CR) 10Ahmon Dancy: [C: 03+1] WIP: logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) (owner: 10Brennen Bearnes) [16:48:24] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host bast6001.wikimedia.org [16:48:24] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast6001.wikimedia.org [16:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Cmjohnson) [16:48:50] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: Consolidate max_statement_time errors [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) (owner: 10Lucas Werkmeister (WMDE)) [16:48:54] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: Consolidate another kind of OOM messages [puppet] - 10https://gerrit.wikimedia.org/r/747102 (owner: 10Lucas Werkmeister (WMDE)) [16:49:19] !log pruned jndilookup.class from log4j-core on logstash 5 instances T297468 [16:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Cmjohnson) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye completed: kubernetes1022 (PASS... [16:49:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Cmjohnson) 05Open→03Resolved ready to turn over [16:49:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Cmjohnson) [16:49:46] (03CR) 10Lucas Werkmeister (WMDE): "(These already went away again AFAICT, so personally I wouldn’t mind if this was abandoned, FWIW. But I suppose it wouldn’t hurt either.)" [puppet] - 10https://gerrit.wikimedia.org/r/747102 (owner: 10Lucas Werkmeister (WMDE)) [16:49:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Cmjohnson) 05Open→03Resolved ready to turnover [16:51:15] I will promote wikis to group 1 in 10 minutes [16:51:29] hashar: can it wait for a while? [16:51:36] I'm at middle of a deploy [16:51:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) >>! In T272559#7575754, @jbond wrote: >> I wrote a small script to grep operations-puppet and cloud-instance-puppet > > Did you also see `utils/audi... [16:52:26] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) a:05Cmjohnson→03Ottomata @Ottomata Can you verify that this is using the correct partman recipe, the installer fails during the install at th... [16:52:54] (03Merged) 10jenkins-bot: Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747699 (https://phabricator.wikimedia.org/T297147) (owner: 10Ladsgroup) [16:53:11] dcausse: re unused puppet resources, its while since i looked but it also parses https://openstack-browser.toolforge.org/puppetclass/ [16:53:23] sorry that was intended for dcaro ^^ [16:54:18] (03Merged) 10jenkins-bot: Revision: Add two caching layers to loadSlotRecords for template pages [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747700 (https://phabricator.wikimedia.org/T297147) (owner: 10Ladsgroup) [16:55:02] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/includes/Revision/RevisionStore.php: Backport: [[gerrit:747699|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s) [16:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:08] T297147: RevisionStore::newRevisionSlots() needs a cache - https://phabricator.wikimedia.org/T297147 [16:55:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:52] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/Revision/RevisionStore.php: Backport: [[gerrit:747700|Revision: Add two caching layers to loadSlotRecords for template pages (T297147)]] (duration: 01m 06s) [16:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:26] hashar: so I need around an hour to measure its impact and then it's definitely fine to move forward [16:57:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10jbond) > Will take a look, I thought that one only checked puppetdb it also parses https://openstack-browser.toolforge.org/puppetclass/ although dose so with... [16:58:03] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Papaul) [17:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:20] Amir1: agh I missed your message :D [17:00:43] we wanted to get group1 as soon as possible then do group 2 during the normal window [17:01:03] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [17:01:05] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [17:01:07] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [17:02:27] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [17:02:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:02:33] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:57] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:04:33] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Ottomata) a:05Ottomata→03BTullis I'm not familiar with what is going on with this node atm, pinging @btullis! [17:05:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2003 is CRITICAL: 350 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [17:05:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2005 is CRITICAL: 59 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2005 [17:06:00] this is me --^ [17:06:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:35] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10BTullis) I'm happy to look at it. It's likely that I've set the wrong partman recipe, so sincere apologies if I've wasted your time. I'll look at it asap. [17:09:55] (03PS1) 10Michael Große: bridge: Reenable scrolling by mounting into parent [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747704 [17:10:39] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [17:11:37] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Papaul) 05Open→03Resolved [17:11:41] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [17:11:45] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [17:11:49] (03CR) 10Ejegg: [C: 03+1] "+1 I approve this update. Majavah, want to remove your -1 now that the commit message has been updated?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649942 (https://phabricator.wikimedia.org/T270308) (owner: 10Ejegg) [17:12:50] Amir1: still around? ;D [17:13:00] I'm always around [17:13:08] heh [17:13:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2005 [17:13:29] I'm checking the graphs, so far it's good. Just need to wait for a bit [17:13:30] so the short story is wmf.13 had to be rolled back yesterday and I wanted to move it to group 1 now to start collecting errors [17:13:52] I guess if you don't need hours of data, that is fine :] [17:14:16] does the extra caching saves db queries? [17:14:27] yeah [17:14:40] I need it to be around half an hour now [17:14:48] then feel free to move the train [17:15:00] (03CR) 10Ahmon Dancy: [C: 03+1] logspam: Consolidate another kind of OOM messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747102 (owner: 10Lucas Werkmeister (WMDE)) [17:15:04] (03CR) 10Ahmon Dancy: logspam: Consolidate another kind of OOM messages [puppet] - 10https://gerrit.wikimedia.org/r/747102 (owner: 10Lucas Werkmeister (WMDE)) [17:15:29] (03PS1) 10Ebernhardson: dumps: Move cirrus dumps to friday [puppet] - 10https://gerrit.wikimedia.org/r/747879 (https://phabricator.wikimedia.org/T265056) [17:17:33] (03PS2) 10Ebernhardson: dumps: Move cirrus dumps to friday [puppet] - 10https://gerrit.wikimedia.org/r/747879 (https://phabricator.wikimedia.org/T265056) [17:18:47] ok I am taking a short break :] [17:19:40] (03PS1) 10Elukey: kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 [17:19:53] (03PS1) 10RLazarus: Add a pod_name column to ActiveContainerImage [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747881 (https://phabricator.wikimedia.org/T287130) [17:20:33] (03PS1) 10Volans: sre.ganeti.makevm: fix vlan selection for drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747882 [17:22:04] (03CR) 10BBlack: [C: 03+1] sre.ganeti.makevm: fix vlan selection for drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747882 (owner: 10Volans) [17:25:05] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: fix vlan selection for drmrs [cookbooks] - 10https://gerrit.wikimedia.org/r/747882 (owner: 10Volans) [17:26:25] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10DAbad) 05Open→03In progress p:05Triage→03High [17:29:52] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host bast6001.wikimedia.org [17:29:55] !log pruned jndilookup.class from log4j-core on logstash 7 instances T297468 [17:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:59] (03PS2) 10Brennen Bearnes: WIP: logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) [17:33:38] (03PS2) 10Elukey: kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 [17:38:59] (03CR) 10Herron: [C: 03+2] mx: make exim queue alert paging [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron) [17:42:23] (03PS5) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [17:43:08] (03CR) 10Cwhite: role: add apifeatureusage role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [17:45:31] (03CR) 10RLazarus: Use the Kubernetes config API as it was in v7.0.0 (buster) (031 comment) [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/747683 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [17:45:38] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: relocate hosts updater function [puppet] - 10https://gerrit.wikimedia.org/r/747849 [17:45:40] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: cache openstack query [puppet] - 10https://gerrit.wikimedia.org/r/747850 [17:45:42] (03PS5) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 [17:47:45] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid-configurator: introduce some code to detect dead config [puppet] - 10https://gerrit.wikimedia.org/r/747851 (owner: 10Arturo Borrero Gonzalez) [17:48:58] (03PS4) 10RLazarus: imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) [17:50:13] (03CR) 10RLazarus: imagecatalog: Pass cluster names along with config paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747685 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [17:50:16] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10herron) [17:50:20] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, and 2 others: large MX queues should page - https://phabricator.wikimedia.org/T297144 (10herron) 05Open→03Resolved a:03herron I know the task description says "threshold to be determined" but calling more attention to the current check wou... [17:57:17] back [17:57:30] (03CR) 10Accraze: [C: 03+1] kserve-inference: allow the definition of tranformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/747880 (owner: 10Elukey) [17:57:57] Amir1: good to go ? :] [17:58:06] yes, thanks! [17:58:35] :] [18:00:04] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1800). [18:01:37] promoting [18:01:43] (03PS1) 10Hashar: group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747887 [18:01:45] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747887 (owner: 10Hashar) [18:02:28] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747887 (owner: 10Hashar) [18:03:58] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.13 refs T293954 [18:04:01] oh I should have looked at logstash before [18:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:05] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [18:04:13] Amir1: there are bunch of issues showing up [18:04:14] [{reqId}] {exception_url} MediaWiki\Storage\NameTableAccessException: Failed to access name from slot_roles using id = 0 [18:04:18] [{reqId}] {exception_url} PHP Notice: Undefined property: MediaWiki\Revision\SlotRecord::$slot_role_id [18:04:38] and 4 [{reqId}] {exception_url} LogicException: Instances of MediaWiki\Revision\SlotRecord are not serializable! [18:05:05] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.13 refs T293954 (duration: 01m 05s) [18:05:09] these are related to mine, I'll fix them [18:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:47] it is not that many errors like ~ 500 over one hour [18:08:04] why it's all enwiki [18:08:14] confused [18:08:20] I will take a look soon [18:08:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] (03PS1) 10Volans: sre.ganeti.makevm: support drmrs netbox sync [cookbooks] - 10https://gerrit.wikimedia.org/r/747890 [18:09:21] seems to start with PHP Notice: Undefined property: MediaWiki\Revision\SlotRecord::$slot_role_id [18:09:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:02] My hypothesis is that there is something in wmf.13 that I haven't backported and this patch somehow depended on [18:11:12] possibly [18:11:24] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10Legoktm) Added to https://meta.wikimedia.org/w/index.php?title=Tech%2FNews%2F2021%2F51&ty... [18:11:31] if has little to no impact, I will push wmf.13 to the other wikis in a couple hours [18:11:33] it can be some sort of cache corruption and lack of proper handling of it [18:11:39] possibly [18:11:44] it's weird it's only enwiki [18:12:11] without having looked at the patch, enwiki also has lots of old revisions that might have been serialized differently [18:16:17] this is just slot information, it should be fully updated but it's possible [18:20:12] (03PS1) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:20:52] found a fix that's needed, I don't know if it solves this issue or not [18:22:16] (03PS7) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [18:22:41] (03CR) 10Jbond: "added cdanis as (to my surprise) the only other zsh user" [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [18:23:28] legoktm: this fix an edge case: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/747892 can't say for sure, if it's the same or different [18:23:54] +2'd [18:25:17] (03PS1) 10Ladsgroup: Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747906 [18:25:27] (03PS1) 10Ladsgroup: Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747907 [18:25:32] (03CR) 10Ladsgroup: [C: 03+2] Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747906 (owner: 10Ladsgroup) [18:25:35] (03CR) 10Ladsgroup: [C: 03+2] Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747907 (owner: 10Ladsgroup) [18:25:45] Thanks. Cherry-picked to be deployed [18:29:13] (03PS2) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:31:16] (03PS3) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:31:23] (03CR) 10AOkoth: [C: 03+2] gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [18:32:45] (03PS4) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:33:14] (03PS5) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:38:21] (03PS6) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [18:38:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:40:43] (03CR) 10Jbond: P:environment: Add a simple zshrc file to the home dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [18:41:40] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar), 10User-CDanis: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [18:41:44] train looks good so far [18:46:38] (03Merged) 10jenkins-bot: Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747906 (owner: 10Ladsgroup) [18:47:31] (03Merged) 10jenkins-bot: Revision: Bypass checking the cache if it's not found [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747907 (owner: 10Ladsgroup) [18:48:32] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/Revision/RevisionStore.php: Backport: [[gerrit:747906|Revision: Bypass checking the cache if it's not found]] (duration: 01m 06s) [18:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:55] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/includes/Revision/RevisionStore.php: Backport: [[gerrit:747907|Revision: Bypass checking the cache if it's not found]] (duration: 01m 06s) [18:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:01] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host bast6001.wikimedia.org [18:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T1900). [19:00:05] wugapodes, Jdlrobson, and MichaelG_WMDE: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] hi [19:00:32] present [19:00:56] I can deploy [19:01:07] here [19:01:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:25] i'm here too [19:01:29] but I'll leave it to RoanKattouw :)) [19:01:30] (03CR) 10Catrope: [C: 03+2] bridge: Reenable scrolling by mounting into parent [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747704 (owner: 10Michael Große) [19:01:58] urbanecm: I have to leave in 20 mins so you might have to do the last patch (the backport) if CI is slow [19:02:07] (03PS4) 10Catrope: Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) (owner: 10Jdlrobson) [19:02:08] RoanKattouw: sounds good [19:02:10] ping when I'm needed [19:02:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:28] (03CR) 10Catrope: [C: 03+2] Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) (owner: 10Jdlrobson) [19:04:04] (03Merged) 10jenkins-bot: Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) (owner: 10Jdlrobson) [19:05:38] Jdlrobson: Your patch is on mwdebug1002 for tesing [19:05:41] testing [19:06:30] Does X-Debug work with safemode RoanKattouw ? [19:06:40] hashar: error cleaned up [19:06:43] I think so? Not sure [19:06:56] Amir1: great [19:06:57] thanks legoktm ! [19:07:06] :D [19:07:09] * legoktm did nothing [19:07:10] there are no other concerning errors apparently [19:07:28] I found a couple swift page I/O errors but that looks like one off ones [19:08:00] it looks good overall [19:08:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:20] RoanKattouw: I can't see the change for some reason [19:08:21] (03PS2) 10Catrope: Enwiki config: remove autopatrol from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [19:08:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:32] for the javascript client side errors I am afraid I am not qualified / unfamiliar with them unfortunately [19:09:31] RoanKattouw: definitely 1002 ? [19:09:35] Yes 1002 [19:09:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:08] Jdlrobson: And you're testing on eu, not en? [19:10:12] eu yeh [19:10:47] Jdlrobson: Is it trying to switch on functionality that only exists in wmf.13 (this week's train)? [19:10:53] Because eu is still on wmf.12 due to yesterday's train deply [19:10:55] *delay [19:10:57] oh possibly... [19:12:06] yep that's it [19:12:10] feel free to sync [19:12:14] Ok syncing [19:12:17] didn't realize we rolled back. https://phabricator.wikimedia.org/T293470 [19:12:20] says wmf13 [19:12:33] can double check on beta cluster in a bit [19:12:41] (03CR) 10Catrope: [C: 03+2] Enwiki config: remove autopatrol from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [19:12:58] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:13:34] !log catrope@deploy1002 Synchronized wmf-config: Config: [[gerrit:745335|Enable VectorLanguageInMainPageHeader on main page (T293470)]] (duration: 01m 06s) [19:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:39] T293470: Easier language switching from Main page - https://phabricator.wikimedia.org/T293470 [19:13:41] (03Merged) 10jenkins-bot: Enwiki config: remove autopatrol from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [19:15:04] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:15:32] wugapodes: Your patch is ready on mwdebug1002, please test [19:15:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:56] will do [19:17:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:24] er, sorry, first deploy having some trouble with the workflow [19:18:09] thanks Roan [19:19:29] wugapodes: let us know if you need any help [19:20:41] got it working with the browser extension :) [19:21:00] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:21:52] wugapodes: outstanding username, btw [19:21:54] Special:ListGroupRights looks correct (sysop no longer has autopatroller listed) [19:21:58] Yay, syncing [19:22:55] urbanecm: I have to go soon, could you deploy MichaelG_WMDE 's patch once it finishes CI? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/747704 [19:23:03] certainly [19:23:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:14] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743646|Enwiki config: remove autopatrol from sysop (T297058)]] (duration: 01m 06s) [19:23:15] RoanKattouw: is everything else done there? [19:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:19] Yes [19:23:20] T297058: Remove autopatroller from sysop toolkit on English Wikipedia - https://phabricator.wikimedia.org/T297058 [19:23:23] okay, good to know [19:23:27] wugapodes: Thanks, your change is deployed [19:23:59] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10thcipriani) >>! In T297621#7569114, @MatthewVernon wrote: > @thcipriani I think you're the right person to approve additions to the restricted group; can you confirm you're happy for this to... [19:24:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:30] RoanKAttouw: thank you! and thanks everyone for your help and patience [19:24:43] (03PS1) 10Jbond: P:puppet_compiler: update workers to use shared puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/747899 [19:25:16] just checked and the deployment seems to work [19:25:58] (03Merged) 10jenkins-bot: bridge: Reenable scrolling by mounting into parent [extensions/Wikibase] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747704 (owner: 10Michael Große) [19:26:10] :) [19:26:38] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: update workers to use shared puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/747899 (owner: 10Jbond) [19:27:48] MichaelG_WMDE: i see it got merged! [19:27:50] let's deploy it too [19:28:28] MichaelG_WMDE: it's at mwdebug1001 -- can you test please? [19:28:37] it has been a while since I did one of these as well. I can test with the extension on test.wikipedia.org, right? [19:28:40] * MichaelG_WMDE checks [19:28:55] MichaelG_WMDE: yes, or at any other wiki that currently has wmf.13 [19:29:14] per https://www.wikidata.org/wiki/Special:Version, Wikidata currently uses wmf.13 [19:29:27] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug are the docs if you need to refresh your memory [19:29:28] it works! Thank you 😊 [19:29:32] syncing! [19:30:11] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10MaryMunyoki) I have signed L3 >>! In T297842#7575369, @MatthewVernon wrote: > @MaryMunyoki this request is good to go once we have confirmation you've signed the... [19:30:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:54] (03PS1) 10Legoktm: Pretend mw1456 is a parsoid appserver for benchmarking [puppet] - 10https://gerrit.wikimedia.org/r/747900 (https://phabricator.wikimedia.org/T297259) [19:31:10] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/Wikibase/: 779938386e32cda075a5790ec90e0fedef0ade9d: bridge: Reenable scrolling by mounting into parent (duration: 01m 12s) [19:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:17] MichaelG_WMDE: and, live [19:31:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:31:24] anything else i can do for you today? [19:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:47] No, I'm happy. Thanks a lot and have a nice evening :) [19:31:56] you too! [19:32:27] (03PS1) 10Urbanecm: MentorManager: Only invalidate cache when mentor list exists [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747908 (https://phabricator.wikimedia.org/T297827) [19:32:31] (03CR) 10Urbanecm: [C: 03+2] MentorManager: Only invalidate cache when mentor list exists [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747908 (https://phabricator.wikimedia.org/T297827) (owner: 10Urbanecm) [19:32:35] going to push this too [19:32:35] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33035/console" [puppet] - 10https://gerrit.wikimedia.org/r/747900 (https://phabricator.wikimedia.org/T297259) (owner: 10Legoktm) [19:37:51] (03PS2) 10Jbond: P:puppet_compiler: update workers to use shared puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/747899 [19:46:18] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [19:49:36] (03PS8) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [19:51:28] !log depooling mw1456 for benchmarking (T297259) [19:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] T297259: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 [19:52:24] 10SRE, 10DC-Ops: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10jhathaway) [19:52:30] (03CR) 10Legoktm: [V: 03+1 C: 03+2] Pretend mw1456 is a parsoid appserver for benchmarking [puppet] - 10https://gerrit.wikimedia.org/r/747900 (https://phabricator.wikimedia.org/T297259) (owner: 10Legoktm) [19:54:32] (03Merged) 10jenkins-bot: MentorManager: Only invalidate cache when mentor list exists [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747908 (https://phabricator.wikimedia.org/T297827) (owner: 10Urbanecm) [19:57:11] (03CR) 10Jbond: reposync: add initial repo sync class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [19:57:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:21] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/GrowthExperiments/includes/Mentorship/MentorPageMentorManager.php: b8e64fe189a6a447e68e342ce23cedad6f542df0: MentorManager: Only invalidate cache when mentor list exists (T297827) (duration: 01m 06s) [19:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:26] T297827: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T297827 [19:58:33] !log UTC evening B&C window done [19:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] hashar and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211216T2000). [20:00:09] (03PS9) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [20:00:25] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: reimage physical host with new hostname mirror1001 - https://phabricator.wikimedia.org/T297508 (10jhathaway) a:05MoritzMuehlenhoff→03jhathaway [20:00:27] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) a:05MoritzMuehlenhoff→03jhathaway [20:00:47] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) [20:00:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: reimage physical host with new hostname mirror1001 - https://phabricator.wikimedia.org/T297508 (10jhathaway) 05Open→03Resolved [20:01:11] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [20:02:26] 10SRE, 10DC-Ops: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10jhathaway) [20:02:28] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) [20:02:54] o/ [20:03:02] RECOVERY - Check systemd state on ms-be2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:42] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) [20:05:15] (03PS1) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747903 (https://phabricator.wikimedia.org/T291737) [20:08:11] soo hmm good evening [20:08:17] (03Abandoned) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:08:18] since group1 barely raised any error [20:08:22] it is time to promote all wikis! [20:09:25] (03CR) 10Ideophagous: "Hello Urbanecm! I've finally found the time to redo this patch. Hopefully it'll go through this time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747903 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:11:39] hashar: fingers crossed [20:12:06] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747903 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:15:09] (03CR) 10jerkins-bot: [V: 04-1] arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747903 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [20:15:45] doing it now [20:15:55] (03PS1) 10Hashar: all wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747905 [20:15:57] (03CR) 10Hashar: [C: 03+2] all wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747905 (owner: 10Hashar) [20:18:06] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747905 (owner: 10Hashar) [20:19:35] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.13 refs T293954 [20:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:40] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:22:55] beside some (max_statement_time exceeded) nothing fancy happening [20:24:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] (03CR) 10BBlack: [C: 03+2] sre.ganeti.makevm: support drmrs netbox sync [cookbooks] - 10https://gerrit.wikimedia.org/r/747890 (owner: 10Volans) [20:31:50] (03PS1) 10JHathaway: mirrors.wikimedia.org: point to new mirror [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) [20:35:18] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:36:27] (03PS3) 10BBlack: Define enterprise names for redirects [dns] - 10https://gerrit.wikimedia.org/r/747168 (https://phabricator.wikimedia.org/T296445) [20:37:13] (03PS3) 10BBlack: Add MW and ncredir redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445) [20:39:21] (03PS4) 10Andrew Bogott: Add initial script to manage/automate cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) [20:39:23] (03PS3) 10Andrew Bogott: Add simple script to backup cinder volumes according to yaml config [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) [20:39:25] (03PS1) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [20:39:42] (03CR) 10BBlack: [C: 03+2] Add MW and ncredir redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445) (owner: 10BBlack) [20:40:48] (03CR) 10jerkins-bot: [V: 04-1] Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [20:44:23] (03PS2) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [20:52:59] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10RKemper) [20:53:34] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10RKemper) [20:58:25] 10SRE-Access-Requests: Requesting access to RESOURCE for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10bking) [20:58:55] 10SRE-Access-Requests: Requesting access to LDAP groups for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10bking) [21:03:56] 10SRE-Access-Requests: Requesting access to LDAP groups for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10RKemper) [21:08:03] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10Ottomata) Approved! [21:08:32] (03CR) 10BBlack: [C: 03+2] Define enterprise names for redirects [dns] - 10https://gerrit.wikimedia.org/r/747168 (https://phabricator.wikimedia.org/T296445) (owner: 10BBlack) [21:08:59] I am off cause well train looks fine ;) [21:09:26] 10SRE-Access-Requests: Requesting access to LDAP groups for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10RKemper) [21:09:40] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:19] (03CR) 10SBassett: "Regardless of what is decided here, the Security Team would encourage the enablement of any process with less friction and one that is lik" [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [21:10:53] 10SRE-Access-Requests: Requesting access to LDAP groups for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10Ottomata) Approved! [21:15:13] (03PS3) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [21:20:04] 10SRE, 10Traffic: Enterprise redirects from .Org sites - https://phabricator.wikimedia.org/T296445 (10BBlack) 05Open→03Resolved These changes should be live now, please let me know if anything's amiss! [21:22:11] Hey dancy hashar: train look good for now? I had a quick revert of a change to PrivateSettings.php I'd like to sync soon, if possible. Thanks. [21:22:32] The train is good. [21:24:12] Thanks, dancy [21:26:40] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Thank you very much @akosiaris and @krobinson would love to move those over to ITS as its part of an epic task (to move all the al... [21:29:12] !log Reverted previous mitigation for T297416 [21:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:18] T297416: Restrict access to most actions on $wgWhitelistRead pages on private wikis - https://phabricator.wikimedia.org/T297416 [21:30:27] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) p:05Triage→03High [21:31:13] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) @MoritzMuehlenhoff, IIRC you were involved in the driver checks for the older H740 controller, would you still be involved for this or can you advise who might? [21:33:55] (03CR) 10Brennen Bearnes: logspam: Consolidate max_statement_time errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) (owner: 10Lucas Werkmeister (WMDE)) [21:34:05] (03PS4) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [21:34:39] 10SRE-Access-Requests: Requesting shell access for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10Dzahn) [21:35:32] 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10Dzahn) [21:36:48] (03PS3) 10Brennen Bearnes: logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) [21:36:53] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10Dzahn) [21:37:45] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10Dzahn) [21:39:38] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10Dzahn) [21:41:23] 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting shell access for Brian King (bking@wikimedia.org) - https://phabricator.wikimedia.org/T297910 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [21:41:29] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ryankemper - https://phabricator.wikimedia.org/T297908 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [21:41:51] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [21:42:10] 10SRE, 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [21:42:28] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [21:44:20] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) 05Open→03Resolved a:03Dzahn optimistcally calling resolved based on previous comments [21:45:57] (03PS5) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [21:47:01] (03CR) 10jerkins-bot: [V: 04-1] Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [21:51:53] 10SRE, 10Infrastructure-Foundations, 10Mail, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [21:52:22] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) [21:53:00] (03PS4) 10Ladsgroup: logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) (owner: 10Brennen Bearnes) [21:53:36] (03CR) 10Ladsgroup: [C: 03+2] logspam: discard upper-cased UTF-8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/726122 (https://phabricator.wikimedia.org/T292246) (owner: 10Brennen Bearnes) [21:53:59] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) >>! In T297307#7574634, @akosiaris wrote: > I am inclined to resolve this task, but I think there might be a followup action item... [21:55:51] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) P.S. Since you are all here. There is also open ticket T252932 which is called "Forwarding or alias for fundraising@" and you can... [21:57:38] (03PS2) 10Ladsgroup: logspam: Consolidate max_statement_time errors [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) (owner: 10Lucas Werkmeister (WMDE)) [21:58:24] (03CR) 10Ladsgroup: [C: 03+2] logspam: Consolidate max_statement_time errors [puppet] - 10https://gerrit.wikimedia.org/r/747876 (https://phabricator.wikimedia.org/T297708) (owner: 10Lucas Werkmeister (WMDE)) [22:01:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti2007.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [22:01:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti2007.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [22:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:45] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [22:03:09] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2007. Ready to be powered off any time. It's the last one \o/ [22:04:03] (03PS6) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [22:04:10] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) I'll look into this in early January [22:04:27] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) a:05Dzahn→03None [22:07:07] (03PS7) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [22:10:02] (03PS8) 10Andrew Bogott: Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) [22:17:28] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:19:59] (03CR) 10Andrew Bogott: [C: 03+2] Add initial script to manage/automate cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [22:20:30] (03CR) 10Andrew Bogott: [C: 03+2] Add simple script to backup cinder volumes according to yaml config [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [22:20:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:10] (03PS4) 10Andrew Bogott: Add simple script to backup cinder volumes according to yaml config [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) [22:28:05] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:29:18] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: add backup job timer to one cloudcontrol in each cluster [puppet] - 10https://gerrit.wikimedia.org/r/747937 (https://phabricator.wikimedia.org/T294429) (owner: 10Andrew Bogott) [22:30:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:25] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:31:55] (03Abandoned) 10Dzahn: wdqs: switch GUI deployment from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [22:52:06] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@2dc8b8b] (eqiad): Update kartotherian-package to e843e8f [22:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:33] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@2dc8b8b] (eqiad): Update kartotherian-package to e843e8f (duration: 02m 27s) [22:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:27] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@2dc8b8b] (codfw): Update kartotherian-package to e843e8f [22:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:14] (03PS1) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) [22:56:51] (03Abandoned) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747903 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [22:57:50] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@2dc8b8b] (codfw): Update kartotherian-package to e843e8f (duration: 02m 23s) [22:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:23] (03CR) 10Ideophagous: "Hello Urbanecm! Sorry for dragging this on. I thought the problem could be the lack of space after double slash, and wanted to try it out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [23:04:10] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@4552dff] (eqiad): Move maxzoom configuration to the proper field [23:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:41] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@4552dff] (eqiad): Move maxzoom configuration to the proper field (duration: 02m 31s) [23:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:53] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@4552dff] (codfw): Move maxzoom configuration to the proper field [23:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:21] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@4552dff] (codfw): Move maxzoom configuration to the proper field (duration: 01m 28s) [23:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:27] (03PS1) 10Legoktm: Use $wgGroupInheritsPermissions for "confirmed" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747977 (https://phabricator.wikimedia.org/T275334) [23:13:37] (03PS2) 10Legoktm: Use $wgGroupInheritsPermissions for "confirmed" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747977 (https://phabricator.wikimedia.org/T275334) [23:23:05] :/ [23:23:11] wrong window [23:25:24] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:29] Anyone here have buttons on wikitech? My 2FA is broken. [23:27:57] (03PS1) 10Eric Gardner: Don't boot users with title="Special:MediaSearch" back to old search page [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747909 (https://phabricator.wikimedia.org/T297877) [23:30:30] (03CR) 10Dzahn: [C: 03+2] miscweb/static_tendril: add dbtree.wikimedia.org as ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [23:31:54] (03CR) 10Dzahn: "Notice: /Stage[main]/Httpd/File[/etc/apache2/sites-enabled/50-dbtree-wikimedia-org.conf]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [23:32:25] (03CR) 10Dzahn: "[cumin2002:~] $ curl -H "Host: dbtree.wikimedia.org" https://miscweb2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [23:33:23] (03CR) 10Dzahn: "[cumin1001:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml --hosts miscweb2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/747662 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [23:35:17] (03CR) 10Dzahn: [C: 03+2] miscweb/static-tendril: have separate apache access and error log [puppet] - 10https://gerrit.wikimedia.org/r/747665 (https://phabricator.wikimedia.org/T297605) (owner: 10Dzahn) [23:35:23] (03PS2) 10Dzahn: miscweb/static-tendril: have separate apache access and error log [puppet] - 10https://gerrit.wikimedia.org/r/747665 (https://phabricator.wikimedia.org/T297605) [23:36:41] (03CR) 10Dzahn: [C: 03+1] "still wanna do this or abandon because "k8s anyways"?" [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [23:43:12] PROBLEM - Device not healthy -SMART- on ms-be2065 is CRITICAL: cluster=swift device=sat+megaraid,14 instance=ms-be2065 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2065&var-datasource=codfw+prometheus/ops [23:47:47] (03CR) 10Legoktm: [C: 03+1] "LGTM, I'd like joe or Daniel to also take a look in case they see any issues." [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [23:49:16] mutante: do the two miscweb servers auto rsync between each other or do I need to do it manually? [23:49:36] I assumed the auto rsync was the case, but I don't actually see any puppet code that would do it [23:55:13] legoktm: either run puppet on both hosts (if it pulls latest for you) or git pull on both hosts (via cumin?) if we set it to just "present" but no rsync between them because the microsites just pull from repo on both backends [23:55:32] this is the static-codereview dump which isn't in git [23:56:09] then we should probably add rsync code in puppet for it if there isn't any yet. .hmm. yea [23:56:45] IIRC we just copied static-bugzilla [23:56:51] I think all other microsites have a deploy repo though [23:57:02] this is just a one-time thing, it'll never need to rsync again [23:58:46] because of this I have "migration" rsync classes around for some things.. that then pop up in "unused puppet classes" ticket [23:59:38] my personal opinion even for one-time it is still easier to copy/paste code than start dealing with "disable iptables" and later "clean up rsyncd fragments"