[00:01:39] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:15:58] (03PS1) 10Krinkle: ext.wikimediamessages.contactpage: Combine two minor modules [extensions/WikimediaMessages] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802118 [00:16:12] (03CR) 10Krinkle: [C: 03+2] ext.wikimediamessages.contactpage: Combine two minor modules [extensions/WikimediaMessages] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802118 (owner: 10Krinkle) [00:17:02] (03CR) 10Krinkle: [C: 03+2] MetaContactPages: Update reference to `ext.wikimediamessages.contactpage` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801423 (owner: 10Krinkle) [00:17:16] * Krinkle testing on mwdebug1002 [00:18:03] (03Merged) 10jenkins-bot: MetaContactPages: Update reference to `ext.wikimediamessages.contactpage` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801423 (owner: 10Krinkle) [00:18:23] (03PS3) 10Krinkle: profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) [00:18:28] (03PS2) 10Krinkle: Profiler: Update wmfSetupProfiler() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801831 (https://phabricator.wikimedia.org/T308932) [00:18:32] (03PS2) 10Krinkle: Profiler: Remove temporary back-compat for wmfSetupProfiler() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801832 (https://phabricator.wikimedia.org/T308932) [00:23:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:08] (03Merged) 10jenkins-bot: ext.wikimediamessages.contactpage: Combine two minor modules [extensions/WikimediaMessages] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/802118 (owner: 10Krinkle) [00:34:15] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:38:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:39:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:29] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.14/extensions/WikimediaMessages/: I5a700cd3648 (duration: 03m 01s) [00:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:04] (03CR) 10Krinkle: [C: 03+1] "wmf.14 is scheduled to reach group2 by afternoon UTC." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [00:50:59] !log krinkle@deploy1002 Synchronized wmf-config/MetaContactPages.php: Ief1368fd959f428 (duration: 02m 56s) [00:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:09] (03PS4) 10Krinkle: profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) [00:57:52] (03PS3) 10Krinkle: Profiler: Update wmfSetupProfiler() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801831 (https://phabricator.wikimedia.org/T308932) [00:58:40] (03PS3) 10Krinkle: Profiler: Remove temporary back-compat for wmfSetupProfiler() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801832 (https://phabricator.wikimedia.org/T308932) [01:00:03] (03CR) 10Krinkle: [C: 03+2] profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:00:46] (03CR) 10Krinkle: [C: 03+2] profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:01:01] (03CR) 10Krinkle: [C: 03+2] Profiler: Update wmfSetupProfiler() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801831 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:01:28] (03Merged) 10jenkins-bot: profiler: Turn from functions into class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796300 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:01:46] (03Merged) 10jenkins-bot: Profiler: Update wmfSetupProfiler() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801831 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:01:59] (03CR) 10Krinkle: [C: 03+2] Profiler: Remove temporary back-compat for wmfSetupProfiler() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801832 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:02:40] (03Merged) 10jenkins-bot: Profiler: Remove temporary back-compat for wmfSetupProfiler() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801832 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:02:51] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:08] !log krinkle@deploy1002 Synchronized src/Profiler.php: I93b3e43d32 (duration: 03m 16s) [01:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:00] (03CR) 10Krinkle: [C: 04-1] Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [01:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:34] !log krinkle@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: Iebd29aaa (duration: 02m 57s) [01:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:14:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:41] !log krinkle@deploy1002 Synchronized src/Profiler.php: I257b41a45 (duration: 03m 15s) [01:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:21] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:23:45] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:23:52] (03CR) 10RLazarus: "The structure looks good! But a couple of the tests are failing:" [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [01:26:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:27:41] (03PS7) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [01:28:04] (03PS2) 10Krinkle: tests: Assert that wikiversions.json is complete as per all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796049 (https://phabricator.wikimedia.org/T308932) [01:28:08] (03CR) 10Krinkle: [C: 03+2] tests: Assert that wikiversions.json is complete as per all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796049 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:28:51] (03Merged) 10jenkins-bot: tests: Assert that wikiversions.json is complete as per all.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796049 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:30:32] (03CR) 10CI reject: [V: 04-1] Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [01:35:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:28] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Remove redundant array_search and missing.php ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796050 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:36:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:36:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:06] (03CR) 10Tim Starling: "I set up ATS locally and tested the module. There were some surprises, which I've documented in the file." [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [01:37:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:33] !log krinkle@deploy1002 Synchronized multiversion/: Id9b34b755230 no-op (duration: 03m 12s) [01:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:36] (03PS3) 10Krinkle: CommonSettings: Remove redundant array_search and missing.php ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796050 (https://phabricator.wikimedia.org/T308932) [01:40:58] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Remove redundant array_search and missing.php ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796050 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:41:48] (03Merged) 10jenkins-bot: CommonSettings: Remove redundant array_search and missing.php ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796050 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [01:42:02] (03PS8) 10Tim Starling: Implement MediaWiki multi-DC traffic component [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) [01:46:17] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10Krinkle) For the record, about "cluster", "dc" and "servergroup" - I took a stab at unifying this as outlined at !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:48:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:17] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Krinkle) 05Open→03Resolved Appears resolved. Unless there is a specific common cause recurring, I assume ther... [01:50:24] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Beta-Cluster-reproducible: Beta cluster MediaWiki code not updating - https://phabricator.wikimedia.org/T300591 (10Krinkle) 05Open→03Resolved [01:52:49] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Krinkle) [02:00:23] (03CR) 10Krinkle: "Does this/should this affect Beta Cluster appservers? ref https://phabricator.wikimedia.org/T237033#7975492" [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [02:00:31] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle) @thcipriani @dancy I believe the equivalent of the `beta-scap-eqiad` job from back then (which is n... [02:01:47] (03CR) 10Krinkle: [C: 03+2] "Test case: https://nds.wikiversity.org/ - LGTM before and after." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796050 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [02:04:06] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: Ic0e134c61d6 (duration: 03m 02s) [02:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:51] RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:02] !log krinkle@deploy1002 Synchronized docroot/noc/: Ic0e134c61d6 (duration: 03m 20s) [02:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:27] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:07] (03PS1) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:31:53] (03PS2) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:42:29] (03PS3) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:44:59] (03PS4) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:45:52] (03CR) 10CI reject: [V: 04-1] Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott) [02:45:56] (03PS5) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:46:50] (03CR) 10CI reject: [V: 04-1] Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott) [02:50:46] (03PS6) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:52:58] (03PS11) 10Tim Starling: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [02:53:00] (03PS3) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) [02:53:02] (03PS7) 10Andrew Bogott: Cloud VMs: manage resolv.conf with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/802220 [02:54:04] (03CR) 10Tim Starling: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [02:56:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:12:01] (03CR) 10Andrew Bogott: [C: 04-2] "This doesn't work because the resolv conf module isn't supported by cloudinit on Debian." [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott) [03:12:06] (03PS3) 10Tim Starling: Enable SSL for master DB connections in the secondary datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799437 (https://phabricator.wikimedia.org/T134809) [03:12:08] (03PS4) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [03:12:10] (03PS3) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [03:12:21] (03CR) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [03:26:15] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:27:21] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:37:31] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:47] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:22:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:23:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:24:21] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (logstash2023), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:24:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:25:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:32:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [04:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [04:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:39] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:39] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:41:03] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:45:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:45:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:52:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:52:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:05:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s7 T309617 [05:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:25] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [05:05:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s7 T309617 [05:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1136 with weight 0 T309617', diff saved to https://phabricator.wikimedia.org/P29325 and previous config saved to /var/cache/conftool/dbconfig/20220602-050559-ladsgroup.json [05:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:29] (03PS3) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/801684 (https://phabricator.wikimedia.org/T309617) [05:08:37] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/801684 (https://phabricator.wikimedia.org/T309617) (owner: 10Ladsgroup) [05:09:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298560)', diff saved to https://phabricator.wikimedia.org/P29326 and previous config saved to /var/cache/conftool/dbconfig/20220602-050937-ladsgroup.json [05:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:40] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [05:13:36] (03PS1) 10Marostegui: Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/802119 [05:14:47] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:14:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2088 (s1 and s2) T309485', diff saved to https://phabricator.wikimedia.org/P29327 and previous config saved to /var/cache/conftool/dbconfig/20220602-051451-marostegui.json [05:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:56] (03CR) 10Marostegui: [C: 03+2] Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/802119 (owner: 10Marostegui) [05:14:56] T309485: db2088 crashed - https://phabricator.wikimedia.org/T309485 [05:15:20] 10SRE, 10ops-codfw, 10DBA: db2088 crashed - https://phabricator.wikimedia.org/T309485 (10Marostegui) 05Open→03Resolved a:05Marostegui→03Papaul db2088 is back in sync with both s1 and s2 master. I have repooled it. Closing this for now. If it happens again we should probably just decommission it. Tha... [05:15:46] !log T309720 Finished manual rolling restart of `cloudelastic` cluster to get new S3 plugin operational [05:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:50] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [05:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29328 and previous config saved to /var/cache/conftool/dbconfig/20220602-052442-ladsgroup.json [05:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1137 in x1 with minimal weight to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29329 and previous config saved to /var/cache/conftool/dbconfig/20220602-053340-marostegui.json [05:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:45] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [05:39:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P29330 and previous config saved to /var/cache/conftool/dbconfig/20220602-053947-ladsgroup.json [05:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:55] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:11] (03CR) 10Slyngshede: [C: 03+2] Add clarification to comment, to help avoid mistakes using httpd::site. [puppet] - 10https://gerrit.wikimedia.org/r/797110 (owner: 10Slyngshede) [05:48:24] (03PS1) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [05:52:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35664/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [05:54:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298560)', diff saved to https://phabricator.wikimedia.org/P29331 and previous config saved to /var/cache/conftool/dbconfig/20220602-055452-ladsgroup.json [05:54:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [05:54:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [05:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:57] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [05:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29332 and previous config saved to /var/cache/conftool/dbconfig/20220602-055500-ladsgroup.json [05:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T0600). [06:00:08] !log Starting s7 eqiad failover from db1181 to db1136 - T309617 [06:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:12] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [06:00:15] o/ [06:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T309617', diff saved to https://phabricator.wikimedia.org/P29333 and previous config saved to /var/cache/conftool/dbconfig/20220602-060016-ladsgroup.json [06:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:44] RO confirmed [06:00:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1136 to s7 primary and set section read-write T309617', diff saved to https://phabricator.wikimedia.org/P29334 and previous config saved to /var/cache/conftool/dbconfig/20220602-060053-ladsgroup.json [06:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:12] and done [06:01:34] Recent changes seems to be moving [06:03:30] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 190 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:03:49] that should be us [06:04:19] (03PS3) 10Ladsgroup: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) [06:04:33] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801685 (https://phabricator.wikimedia.org/T309617) (owner: 10Ladsgroup) [06:04:35] Amir1: orchestrator still showing lag [06:04:49] hmm [06:04:56] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:05:02] Are you following the steps order? [06:05:14] yup [06:05:23] So was heartbeat cleaned? [06:05:53] (03PS2) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [06:06:16] I think it's not done yet [06:06:26] ok [06:06:47] that's why I was asking if the steps are being followed in case it was done but didn't work or if it wasn't [06:06:53] I can do it if you want [06:07:11] sure [06:07:15] that'd be amazing [06:07:20] ok [06:08:42] fixed [06:08:55] \o/ [06:10:10] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35665/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [06:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1181 T309617', diff saved to https://phabricator.wikimedia.org/P29335 and previous config saved to /var/cache/conftool/dbconfig/20220602-061039-ladsgroup.json [06:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:43] T309617: Switchover s7 master (db1181 -> db1136) - https://phabricator.wikimedia.org/T309617 [06:11:15] Amir1: Remember to edit db1181 to give the weight db1136 had before (otherwise it will be pooled with 0 weight) [06:11:42] marostegui: we need to depool it for maint [06:11:42] You can do that when it is depooled too [06:11:54] oh, how can I edit it? [06:12:04] Amir1: yes, what I mean is db1181 current weight is 0 and if you pool it back, it will still have weight [06:12:07] Amir1: dbctl instance db1181 edit [06:12:13] awesome [06:12:19] just edit the weight and give the same weight db1136 had before [06:12:47] https://phabricator.wikimedia.org/P29325 [06:12:51] 400 it should be [06:13:20] yep! [06:13:26] marostegui: done [06:13:43] Amir1: but it is pooled, isn't it? [06:13:44] (03PS3) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [06:13:52] I depooled it [06:14:02] ah good [06:14:16] okay, it seems it's all done [06:14:23] zarcillo checked? [06:14:44] (that item isn't marked) [06:16:30] it's correct [06:17:25] Amir1: mark that step as done, so we know it wasn't missed [06:17:38] done [06:17:48] \o/ [06:17:55] I'm closing the ticket and have some ideas on how to improve it later :D [06:18:01] cool! [06:18:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35666/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [06:18:36] I need to run some errands, will be back to run the schema changes, feel free to do whatever you like with it in the mean time [06:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29336 and previous config saved to /var/cache/conftool/dbconfig/20220602-063710-marostegui.json [06:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [06:49:18] (03PS1) 10Muehlenhoff: Add Ferran Tufan to contributors [puppet] - 10https://gerrit.wikimedia.org/r/802423 [06:52:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29337 and previous config saved to /var/cache/conftool/dbconfig/20220602-065203-marostegui.json [06:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:09] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [06:53:13] (03CR) 10Muehlenhoff: [C: 03+2] Add Ferran Tufan to contributors [puppet] - 10https://gerrit.wikimedia.org/r/802423 (owner: 10Muehlenhoff) [06:59:13] (03PS1) 10Muehlenhoff: Add new bullseye IDPs to acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/802424 [07:00:05] Amir1 and apergos: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T0700). [07:00:50] hello [07:01:01] no trainees have signed up for the window today [07:01:15] there are also no patches scheduled for deployment [07:01:35] anyone who would like to self-deploy last minute can add themselves to the calendar [07:01:42] otherwise in about 15 minutes I will wander off. [07:05:46] !log installing systemd bugfix updates from last bullseye point release, also includes a minor security fix in systemd-tmpfiles [07:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29338 and previous config saved to /var/cache/conftool/dbconfig/20220602-071547-marostegui.json [07:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:52] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [07:22:06] (03PS1) 10David Caro: ceph: remove buster repos and move to croit mirrors for the rest [puppet] - 10https://gerrit.wikimedia.org/r/802425 [07:22:57] (03PS2) 10David Caro: ceph: remove nautilus-buster repos and move to croit [puppet] - 10https://gerrit.wikimedia.org/r/802425 [07:24:06] (03PS4) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [07:24:59] (03CR) 10CI reject: [V: 04-1] WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [07:25:48] (03PS5) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [07:26:35] !log joal@deploy1002 Started deploy [analytics/refinery@ef68481]: Additional analytics weekly train [analytics/refinery@ef68481] [07:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:35] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:30:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35667/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [07:36:19] (03PS4) 10David Caro: Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 [07:38:28] (03PS5) 10David Caro: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 [07:39:59] (03PS4) 10David Caro: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 [07:41:09] (03PS6) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [07:41:16] (03CR) 10David Caro: wmcs: added missing __init__.py and relted lint fixes (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 (owner: 10David Caro) [07:42:02] (03CR) 10CI reject: [V: 04-1] WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [07:43:09] 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) [07:43:31] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (logstash2023), Fresh: 115 jobs Jcrespo known T237224 - The acknowledgement expires at: 2022-06-02 12:42:59. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:45:43] (03CR) 10David Caro: [C: 03+2] wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 (owner: 10David Caro) [07:45:45] (03CR) 10David Caro: [C: 03+2] Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 (owner: 10David Caro) [07:47:02] (03PS7) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [07:49:06] (03Merged) 10jenkins-bot: Fix spelling errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801730 (owner: 10David Caro) [07:49:41] (03Merged) 10jenkins-bot: wmcs: added missing __init__.py and relted lint fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/801732 (owner: 10David Caro) [07:51:09] !log joal@deploy1002 Finished deploy [analytics/refinery@ef68481]: Additional analytics weekly train [analytics/refinery@ef68481] (duration: 24m 33s) [07:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:57] (03PS8) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [07:54:11] (03CR) 10Vgutierrez: [C: 03+1] Add new bullseye IDPs to acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/802424 (owner: 10Muehlenhoff) [07:54:48] !log joal@deploy1002 Started deploy [analytics/refinery@ef68481] (thin): Additional analytics weekly train THIN [analytics/refinery@ef68481] [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:56] !log joal@deploy1002 Finished deploy [analytics/refinery@ef68481] (thin): Additional analytics weekly train THIN [analytics/refinery@ef68481] (duration: 00m 08s) [07:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:35] !log joal@deploy1002 Started deploy [analytics/refinery@ef68481] (hadoop-test): Additional analytics weekly train TEST [analytics/refinery@ef68481] [07:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35669/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [07:59:35] (03CR) 10Slyngshede: WIP: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [07:59:53] (03PS9) 10Slyngshede: P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 [08:01:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35670/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [08:02:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35671/console" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [08:03:09] !log joal@deploy1002 Finished deploy [analytics/refinery@ef68481] (hadoop-test): Additional analytics weekly train TEST [analytics/refinery@ef68481] (duration: 07m 33s) [08:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [08:09:45] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10MoritzMuehlenhoff) This also needs sign off by either one of @conny-kawohl_WMDE @WMDE-leszek @darthmon_wmde @Tobi_WMDE_SW @Lea_WMDE @karapayneWMDE [08:10:07] 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:10:29] 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:13:34] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.1 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) a:03JMeybohm [08:13:44] 10SRE, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:14:06] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:14:25] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:16:00] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:aptrepo::wikimedia switch public apt repo to use new define. [puppet] - 10https://gerrit.wikimedia.org/r/802355 (owner: 10Slyngshede) [08:22:13] (03PS1) 10Majavah: P:wmcs::prometheus: set openstack scrape_interval to 5m [puppet] - 10https://gerrit.wikimedia.org/r/802434 [08:26:02] (03PS2) 10Majavah: P:wmcs::prometheus: set openstack scrape_interval to 4m [puppet] - 10https://gerrit.wikimedia.org/r/802434 [08:27:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29339 and previous config saved to /var/cache/conftool/dbconfig/20220602-082700-marostegui.json [08:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:05] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [08:28:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:32:01] !log imported scap 4.8.1 to stretch-/buster-/bullseye-wikimedia - T309116 [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:06] T309116: Deploy Scap version 4.8.1 - https://phabricator.wikimedia.org/T309116 [08:33:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802424 (owner: 10Muehlenhoff) [08:34:32] (03PS2) 10Daniel Kinzler: EXPERIMENT: allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) [08:35:34] (03CR) 10Muehlenhoff: [C: 03+2] Add new bullseye IDPs to acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/802424 (owner: 10Muehlenhoff) [08:36:12] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.1 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) ` mwdebug1002:~$ scap pull Traceback (most recent call last): File "/usr/bin/scap", line 32, in from scap import cli File "/usr/lib/python3/dist-packa... [08:39:49] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.1 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) 05Open→03Stalled [08:43:24] (03CR) 10Jbond: "Just an FYI that i have done a bit of work on systemd-resolvd which i hope to start using in prod as soon as i have some time to prioritis" [puppet] - 10https://gerrit.wikimedia.org/r/802220 (owner: 10Andrew Bogott) [08:45:23] (03PS1) 10Majavah: wmcs: Add alert for Neutron agents being down [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) [08:46:33] (03CR) 10Jbond: "LGTM from a puppet PoV" [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [08:48:59] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-06-01 09:11:21 (3102 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [08:49:06] (03CR) 10Jbond: [C: 03+1] "LGTM adding simon as they have been working on this module" [puppet] - 10https://gerrit.wikimedia.org/r/802425 (owner: 10David Caro) [08:49:32] (03PS1) 10Physikerwelt: Explicitly set math rendering modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802443 (https://phabricator.wikimedia.org/T309686) [08:51:58] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802425 (owner: 10David Caro) [08:53:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29340 and previous config saved to /var/cache/conftool/dbconfig/20220602-085357-marostegui.json [08:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:03] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [08:54:26] !log joal@deploy1002 Started deploy [airflow-dags/analytics@19cd054]: (no justification provided) [08:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:36] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@19cd054]: (no justification provided) (duration: 00m 09s) [08:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Since you are removing a component, please make sure to also take care of https://wikitech.wikimedia.org/wiki/Reprepro#Removin" [puppet] - 10https://gerrit.wikimedia.org/r/802425 (owner: 10David Caro) [08:58:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [08:59:59] (03CR) 10Jbond: [C: 03+2] check_netbox_report: add url to output [puppet] - 10https://gerrit.wikimedia.org/r/802075 (owner: 10Jbond) [09:00:33] (03Abandoned) 10Jbond: P:netbox: Add http proxy support to reports [puppet] - 10https://gerrit.wikimedia.org/r/802095 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:03:18] (03CR) 10Jbond: [C: 03+2] P:backup::director: use new sudo_user parameter for nrpe::monitor_service [puppet] - 10https://gerrit.wikimedia.org/r/802165 (owner: 10Jbond) [09:06:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove superflous comment [puppet] - 10https://gerrit.wikimedia.org/r/802174 (owner: 10Muehlenhoff) [09:11:29] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:48] (03PS1) 10Muehlenhoff: Apply idp role to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802444 (https://phabricator.wikimedia.org/T308214) [09:16:31] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:41] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:57] (03PS1) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [09:19:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35672/console" [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [09:28:01] (03CR) 10Muehlenhoff: P::aptrepo::wikimedia install Apache for private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [09:28:42] (03CR) 10Jelto: [C: 03+2] wikimedia.org: reduce TTL for gitlab A and AAAA to 5m [dns] - 10https://gerrit.wikimedia.org/r/802090 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:28:46] (03PS2) 10Jelto: wikimedia.org: reduce TTL for gitlab A and AAAA to 5m [dns] - 10https://gerrit.wikimedia.org/r/802090 (https://phabricator.wikimedia.org/T307142) [09:29:26] (03PS1) 10Jcrespo: backup: Cleanup bacula_check, make dependency explicit [puppet] - 10https://gerrit.wikimedia.org/r/802467 [09:33:57] (03CR) 10Jcrespo: [C: 04-1] "How is this compatible with the other patches merged at https://phabricator.wikimedia.org/T274463 ?" [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [09:39:03] (03CR) 10Jcrespo: "tried to fix the issue using your advice. I tried to test-unit it to prevent it in the future." [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [09:39:07] (03PS4) 10Jcrespo: Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) [09:39:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:39:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:29] (03CR) 10CI reject: [V: 04-1] Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [09:44:11] (03CR) 10Jcrespo: "Hey, Moritz," [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [09:52:12] (03Abandoned) 10Jcrespo: mariadb: Increase core memory usage to 80% of physical memory [puppet] - 10https://gerrit.wikimedia.org/r/455769 (owner: 10Jcrespo) [09:52:37] (03PS1) 10Majavah: gridengine: default to buster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802470 (https://phabricator.wikimedia.org/T277653) [09:53:41] (03Abandoned) 10Jcrespo: mariadb: Remove m1 references to old database bacula, leave only bacula9 [puppet] - 10https://gerrit.wikimedia.org/r/658970 (https://phabricator.wikimedia.org/T260717) (owner: 10Jcrespo) [09:53:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:53:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [09:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:00] (03CR) 10Majavah: [C: 03+2] "https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/CUWV6ML7NBLST2XE57BWYM6MV2FVQYOR/" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802470 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [09:55:02] (03Abandoned) 10Jcrespo: [WIP] Starting to cleanup mariadb templating structure [puppet] - 10https://gerrit.wikimedia.org/r/324915 (https://phabricator.wikimedia.org/T93645) (owner: 10Jcrespo) [09:56:07] (03Abandoned) 10Jcrespo: [WIP] Create scripts for batch sql execution [puppet] - 10https://gerrit.wikimedia.org/r/338809 (owner: 10Jcrespo) [09:56:12] (03Merged) 10jenkins-bot: gridengine: default to buster [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802470 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [09:57:32] (03PS1) 10Majavah: d/changelog: Prepare for 0.84 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802471 [09:58:07] (03CR) 10Majavah: [C: 03+2] d/changelog: Prepare for 0.84 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802471 (owner: 10Majavah) [09:58:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) @Cmjohnson hey are you able to take care of the BIOS / RAID setup for these hosts? All should be ready for normal deploy anyway... [10:00:01] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.84 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802471 (owner: 10Majavah) [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T1000). [10:02:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:02:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:20] (03PS1) 10Jelto: wikimedia.org: make gitlab1004 the new gitlab production host [dns] - 10https://gerrit.wikimedia.org/r/802473 (https://phabricator.wikimedia.org/T307142) [10:14:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:14:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:25] (03CR) 10David Caro: [C: 03+2] "Let's play with this yes, though given the current instability we might have to tweak it a few times before finding a good timing." [puppet] - 10https://gerrit.wikimedia.org/r/802434 (owner: 10Majavah) [10:17:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10zhuyifei1999) [10:17:41] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:18] (03CR) 10David Caro: wmcs: Add alert for Neutron agents being down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [10:19:33] (03PS2) 10Majavah: wmcs: Add alert for Neutron agents being down [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) [10:20:01] (03CR) 10Majavah: wmcs: Add alert for Neutron agents being down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [10:22:57] (03CR) 10David Caro: [C: 03+2] wmcs: Add alert for Neutron agents being down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [10:23:13] (03CR) 10David Caro: [C: 03+2] wmcs: Add alert for Neutron agents being down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [10:25:03] (03Merged) 10jenkins-bot: wmcs: Add alert for Neutron agents being down [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [10:25:48] (03Abandoned) 10Jbond: O:gitlab: add config for backup sets [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [10:28:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:28:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:08] (03Abandoned) 10Jcrespo: acct: Add 2 line cron patch to mitigate cronspam [puppet] - 10https://gerrit.wikimedia.org/r/569532 (https://phabricator.wikimedia.org/T167035) (owner: 10Jcrespo) [10:32:23] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:32:41] (03CR) 10Jbond: "See note i don't think we need this" [puppet] - 10https://gerrit.wikimedia.org/r/802467 (owner: 10Jcrespo) [10:36:41] (03PS1) 10Jbond: CONTRIBUTORS: Add YiFei Zhu [puppet] - 10https://gerrit.wikimedia.org/r/802474 [10:37:15] (03PS2) 10Jbond: CONTRIBUTORS: Add YiFei Zhu [puppet] - 10https://gerrit.wikimedia.org/r/802474 (https://phabricator.wikimedia.org/T308013) [10:37:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] CONTRIBUTORS: Add YiFei Zhu [puppet] - 10https://gerrit.wikimedia.org/r/802474 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [10:39:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:40:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:14] (03PS5) 10David Caro: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 [10:40:16] (03CR) 10David Caro: Add readme, configure script and missing modules (034 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [10:40:19] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:40:22] (03CR) 10Jcrespo: "Thank you for the input. I am doing a bit of cleanup, hopefully getting rid of unneeded old patches :-)" [puppet] - 10https://gerrit.wikimedia.org/r/677970 (https://phabricator.wikimedia.org/T274463) (owner: 10Jbond) [10:40:52] (03CR) 10David Caro: [C: 03+2] "Thanks for the review!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [10:45:30] (03Merged) 10jenkins-bot: Add readme, configure script and missing modules [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/799379 (owner: 10David Caro) [10:50:28] (03PS2) 10Jcrespo: backup: Cleanup bacula_check [puppet] - 10https://gerrit.wikimedia.org/r/802467 [10:51:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:51:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:13] (03CR) 10Jcrespo: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/802467 (owner: 10Jcrespo) [11:03:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:03:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:17] (03CR) 10Muehlenhoff: "I love the idea, but haven't found the time to have a closer look so far, will do so next week." [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:12:59] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10MoritzMuehlenhoff) This was discussed in the Infrastructure Foundations team meeting and was found to be a okay (to grant the permi... [11:13:44] 10SRE, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10MoritzMuehlenhoff) [11:15:40] (03CR) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [11:16:40] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10MoritzMuehlenhoff) p:05Triage→03High Severity is unclear to me from just reading the task, but since we dislike unnecessa... [11:20:47] (03CR) 10Jbond: [C: 03+1] backup: Cleanup bacula_check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802467 (owner: 10Jcrespo) [11:21:38] (03Abandoned) 10Jcrespo: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [11:21:41] (03PS8) 10David Caro: wmcs: Added taskircmail, ircmail and pageircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [11:21:43] (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pageircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [11:21:45] (03PS1) 10David Caro: alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 [11:23:05] (03CR) 10Jbond: [C: 03+1] "LGTM couple of nit/Qs" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [11:23:29] !log installing sysvinit-utils bugfix updates from last bullseye point release [11:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:24] (03CR) 10Jcrespo: [C: 03+2] "Don't feel strongly :-)" [puppet] - 10https://gerrit.wikimedia.org/r/802467 (owner: 10Jcrespo) [11:31:19] !log Restarted Gerrit on replica gerrit2001 [11:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:31] (03CR) 10Jbond: [C: 03+1] "nit tested but lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [11:38:34] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [11:40:48] !log installing tasksel updates from bullseye point release [11:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:15] !log installing python-pip bugfix updates from bullseye point release [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:04] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [11:57:24] (03CR) 10Muehlenhoff: [C: 03+2] Apply idp role to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802444 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [12:02:56] (03PS1) 10David Caro: tools: refresh prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/802494 (https://phabricator.wikimedia.org/T308402) [12:03:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1137 in x1 to test 10.6.8 T309679 ', diff saved to https://phabricator.wikimedia.org/P29343 and previous config saved to /var/cache/conftool/dbconfig/20220602-120320-marostegui.json [12:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:25] T309679: Migrate a x1 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T309679 [12:14:19] (03CR) 10Majavah: [C: 03+1] "the cert looks good (although I don't think these changes really need review anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/802494 (https://phabricator.wikimedia.org/T308402) (owner: 10David Caro) [12:14:57] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:07] !log joal@deploy1002 Started deploy [airflow-dags/analytics@19b943d]: (no justification provided) [12:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:16] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@19b943d]: (no justification provided) (duration: 00m 09s) [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:24] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10Tobi_WMDE_SW) >>! In T309700#7975820, @MoritzMuehlenhoff wrote: > This also needs sign off by either one of @conny-kawohl_WMDE @WMDE-leszek @darthmon_wmde @Tobi_WMDE_SW @Lea_WMDE @... [12:16:44] (03CR) 10David Caro: [C: 03+2] tools: refresh prometheus certificate [puppet] - 10https://gerrit.wikimedia.org/r/802494 (https://phabricator.wikimedia.org/T308402) (owner: 10David Caro) [12:17:41] (03PS2) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [12:19:31] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [12:21:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35673/console" [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [12:22:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [12:23:05] (03CR) 10Slyngshede: [C: 03+2] aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [12:26:22] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/802489 (owner: 10David Caro) [12:28:47] (03CR) 10Muehlenhoff: P::aptrepo::wikimedia install Apache for private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [12:32:24] (03PS1) 10Stang: itwikiversity: Correct typo of "markbotedits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802498 (https://phabricator.wikimedia.org/T309750) [12:33:55] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:37:52] (03PS3) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [12:38:45] (03CR) 10CI reject: [V: 04-1] P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [12:39:05] (03CR) 10JMeybohm: [C: 03+1] "This should be a no-op as the lvs stanza is completely missing. So obviously no need to to the LVS restart dance." [puppet] - 10https://gerrit.wikimedia.org/r/799357 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:39:34] (03CR) 10JMeybohm: [C: 03+1] service: image-suggestion state to monitoring_setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:39:37] (03PS1) 10Cathal Mooney: Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) [12:39:39] (03PS4) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [12:39:55] (03CR) 10JMeybohm: [C: 03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:42:07] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:30] (03PS9) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [12:47:55] (03PS1) 10Kevin Bazira: ml-services: add svwiki & trwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/802500 (https://phabricator.wikimedia.org/T307418) [12:48:00] (03PS8) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [12:48:14] (03PS2) 10David Caro: alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 [12:48:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35674/console" [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [12:49:33] (03PS1) 10Jcrespo: mediabackups: Add test units for the Util helper unit [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802501 (https://phabricator.wikimedia.org/T262668) [12:49:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:52:22] (03CR) 10David Caro: [C: 03+2] ceph: remove nautilus-buster repos and move to croit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802425 (owner: 10David Caro) [12:53:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:53:55] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 97, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:50] (03CR) 10Cathal Mooney: "Overall LGTM... really nice work! Only nit I would have is that we should probably make a similar addition to templates/asw/bgp_overlay.c" [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [12:57:37] PROBLEM - Memcached on idp2002 is CRITICAL: connect to address 208.80.153.108 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [12:57:41] (03CR) 10Cathal Mooney: Add BGP configuration for the new ML staging codfw cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [12:59:13] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10Cmjohnson) {F35200013}. Attached is the final list for recycling. @wiki_willy Disks 318 2.5" ssds/disks 249 3.5" disks [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T1300). [13:00:04] MatmaRex and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:14] i can deploy today! [13:01:20] great! [13:01:27] (was about to write I can’t until :45 ^^) [13:01:28] hi MatmaRex / koi, are you around? [13:01:39] I'm here [13:01:44] hello :) [13:02:19] hi [13:03:00] (03CR) 10Urbanecm: [C: 03+2] itwikiversity: Correct typo of "markbotedits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802498 (https://phabricator.wikimedia.org/T309750) (owner: 10Stang) [13:03:47] (03Merged) 10jenkins-bot: itwikiversity: Correct typo of "markbotedits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802498 (https://phabricator.wikimedia.org/T309750) (owner: 10Stang) [13:04:28] koi: pulled to mwdebug1001, please check [13:04:36] (03PS3) 10Urbanecm: Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802214 (https://phabricator.wikimedia.org/T295425) (owner: 10Bartosz Dziewoński) [13:04:44] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802214 (https://phabricator.wikimedia.org/T295425) (owner: 10Bartosz Dziewoński) [13:04:48] LGTM, thanks! [13:05:13] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:17] syncing [13:05:27] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 120, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:05:31] (03Merged) 10jenkins-bot: Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802214 (https://phabricator.wikimedia.org/T295425) (owner: 10Bartosz Dziewoński) [13:05:45] well, not syncing [13:06:09] 13:05:37 sync-file failed: Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "jnuche"; reason is "Scap is being updated" [13:06:33] sorry, bad timing, please try again [13:06:39] thanks! [13:06:45] now it works [13:06:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) One thing to note as it's not been mentioned in the task description is that the '--enable-virtualization' flag should... [13:07:07] (03PS15) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [13:07:09] (03PS1) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:07:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:11] (03PS1) 10Jbond: wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 [13:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) The work here is largely complete, merging that last patch to add the new switches to monitoring should be t... [13:08:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 192c5356e1fb21ba820615085abcb2185fd1864c: itwikiversity: Correct typo of "markbotedits" (T309750) (duration: 03m 13s) [13:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:57] T309750: Replace markbotedit right with markbotedits in the patrollers group on it.wikiversity - https://phabricator.wikimedia.org/T309750 [13:10:00] koi: should be live now :) [13:10:24] MatmaRex: your first patch is at mwdebug1001, can you check? [13:10:27] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 146, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:10:42] (03PS3) 10Urbanecm: Launch DiscussionTools topic subscriptions a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801818 (https://phabricator.wikimedia.org/T304029) (owner: 10Bartosz Dziewoński) [13:10:58] (03CR) 10Urbanecm: [C: 03+2] Launch DiscussionTools topic subscriptions a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801818 (https://phabricator.wikimedia.org/T304029) (owner: 10Bartosz Dziewoński) [13:11:16] looking [13:11:33] (03PS2) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:11:46] (03CR) 10CI reject: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:11:59] (03Merged) 10jenkins-bot: Launch DiscussionTools topic subscriptions a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801818 (https://phabricator.wikimedia.org/T304029) (owner: 10Bartosz Dziewoński) [13:13:21] urbanecm: yep, looks good [13:13:50] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:13:58] (03CR) 10CI reject: [V: 04-1] wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 (owner: 10Jbond) [13:14:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:15:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:07] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 346, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:16:07] syncing [13:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:37] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:16:43] ACKNOWLEDGEMENT - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T307121 - The acknowledgement expires at: 2022-06-07 13:16:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:24] !log Restarting Gerrit [13:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 806b8367e3c91a2b6b0dd76cdc66e041199ae834: Enable DiscussionTools automatic topic subscriptions as beta feature on remaining wikis (T295425) (duration: 03m 21s) [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:31] T295425: [Config Change] Deploy Automatic Topic Subscriptions as Beta Feature at Remaining Wikis - https://phabricator.wikimedia.org/T295425 [13:19:52] oops sorry I forgot about the deployment windows :-\ [13:20:05] happens :) [13:20:22] looks like it is already back [13:20:55] great! [13:21:05] MatmaRex: your second patch is at mwdebug1001 [13:21:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:56] (03PS3) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:22:01] urbanecm: also looks good [13:22:05] thanks, syncing [13:22:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:22:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:34] 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10ssastry) I think so. Parsoid/JS is no longer supported and won't get security releases either. If anyone on the team has any concerns, they will leave their comments here. [13:23:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3c12e779707e3982f973641e2b9c2522a429830f: Launch DiscussionTools topic subscriptions a/b test (T304029) (duration: 03m 16s) [13:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:44] T304029: Make config change to start Topic Subscriptions A/B Test - https://phabricator.wikimedia.org/T304029 [13:26:30] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:26:32] MatmaRex: synced! [13:26:37] anything else, anyone? [13:26:53] thanks urbanecm [13:27:16] (03PS4) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:27:25] (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:29:16] i guess not [13:29:28] !log UTC afternoon B&C window done [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:52] jnuche: hashar: i'm done with deployments now :) [13:29:57] (03PS3) 10JMeybohm: Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 [13:30:29] urbanecm: 👍 [13:30:42] (03CR) 10CI reject: [V: 04-1] Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [13:31:40] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:32:40] (03PS5) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:34:18] !log hashar@deploy1002 Started deploy [integration/docroot@b55f30e]: build: Updating eslint-config-wikimedia to 0.22.1 [13:34:19] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [13:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:22] (03CR) 10JMeybohm: "https://integration.wikimedia.org/ci/job/helm-lint/7458/console" [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [13:34:27] !log hashar@deploy1002 Finished deploy [integration/docroot@b55f30e]: build: Updating eslint-config-wikimedia to 0.22.1 (duration: 00m 08s) [13:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:03] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:37:22] (03PS4) 10JMeybohm: Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 [13:37:25] (03PS1) 10JMeybohm: Update outdated developer-portal fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/802526 [13:39:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Jclark-ctr) a:03Cmjohnson [13:39:13] (03PS6) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:41:38] (03PS1) 10Andrew Bogott: wmcs-image-create: fix unzipping of .xz files [puppet] - 10https://gerrit.wikimedia.org/r/802527 [13:43:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [13:43:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35681/console" [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:44:08] !log ALTER-ing system_auth replication strategy, AQS Cassandra cluster -- T307641 [13:44:09] (03CR) 10CI reject: [V: 04-1] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:13] T307641: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 [13:45:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:46:00] (03PS9) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [13:46:57] (03CR) 10Slyngshede: Rewrite logster::job to use systemd timers. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:47:15] (03CR) 10JMeybohm: [C: 03+2] Update outdated developer-portal fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/802526 (owner: 10JMeybohm) [13:50:17] (03CR) 10Hnowlan: service: image-suggestion state to monitoring_setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [13:50:36] (03Merged) 10jenkins-bot: Update outdated developer-portal fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/802526 (owner: 10JMeybohm) [13:52:54] (03CR) 10David Caro: [C: 03+1] wmcs-image-create: fix unzipping of .xz files [puppet] - 10https://gerrit.wikimedia.org/r/802527 (owner: 10Andrew Bogott) [13:53:32] (03CR) 10Herron: [C: 03+1] mx: enable tainted data checking [puppet] - 10https://gerrit.wikimedia.org/r/801799 (https://phabricator.wikimedia.org/T286911) (owner: 10JHathaway) [13:54:06] (03PS7) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [13:56:19] (03PS1) 10David Caro: ceph: filter out also dbgsym packages [puppet] - 10https://gerrit.wikimedia.org/r/802531 [13:57:20] !log joal@deploy1002 Started deploy [airflow-dags/analytics@2ad442e]: (no justification provided) [13:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@2ad442e]: (no justification provided) (duration: 00m 08s) [13:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:21] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: fix unzipping of .xz files [puppet] - 10https://gerrit.wikimedia.org/r/802527 (owner: 10Andrew Bogott) [14:00:08] (03PS2) 10Jbond: wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 [14:01:27] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7898695, @JMeybohm wrote: > I finally managed to verify and document the steps needed to put a service under Ingress. I did also update... [14:04:17] (03CR) 10Cathal Mooney: Add BGP configuration for the new ML staging codfw cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/802072 (https://phabricator.wikimedia.org/T302198) (owner: 10Elukey) [14:04:39] (03CR) 10CI reject: [V: 04-1] wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 (owner: 10Jbond) [14:07:22] (03CR) 10David Caro: "Example of dbgsym package:" [puppet] - 10https://gerrit.wikimedia.org/r/802531 (owner: 10David Caro) [14:11:02] (03PS2) 10David Caro: ceph: filter out also dbgsym packages [puppet] - 10https://gerrit.wikimedia.org/r/802531 (https://phabricator.wikimedia.org/T309786) [14:14:28] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [14:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:19] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:22:37] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:23:33] (03PS2) 10Jcrespo: mediabackups: Add test units for the Util helper unit [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802501 (https://phabricator.wikimedia.org/T262668) [14:24:22] (03CR) 10AOkoth: [C: 03+2] vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [14:24:40] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add test units for the Util helper unit [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802501 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [14:26:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:29:51] (03PS3) 10Hnowlan: restbase-dev: change role of new hosts [puppet] - 10https://gerrit.wikimedia.org/r/766082 (https://phabricator.wikimedia.org/T295375) [14:36:12] (03PS2) 10David Caro: network.tests:Use correct object for site [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802158 [14:38:23] (03PS3) 10Jbond: wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 [14:40:13] (03PS8) 10Jbond: wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 [14:40:33] (03PS4) 10Jbond: wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 [14:41:42] (03CR) 10David Caro: [C: 03+2] network.tests:Use correct object for site [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802158 (owner: 10David Caro) [14:42:01] (03PS1) 10AOkoth: vrts: fix apache error when running puppet [puppet] - 10https://gerrit.wikimedia.org/r/802538 (https://phabricator.wikimedia.org/T309788) [14:43:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [14:45:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] vrts: fix apache error when running puppet [puppet] - 10https://gerrit.wikimedia.org/r/802538 (https://phabricator.wikimedia.org/T309788) (owner: 10AOkoth) [14:45:57] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10MoritzMuehlenhoff) As for "Thumbor currently runs in firejail, do we lose anything by dropping it in k8s", that's fine. firejail was our workaround for the original service abst... [14:46:41] (03CR) 10JMeybohm: [C: 03+2] Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [14:47:03] (03CR) 10AOkoth: [C: 03+2] vrts: fix apache error when running puppet [puppet] - 10https://gerrit.wikimedia.org/r/802538 (https://phabricator.wikimedia.org/T309788) (owner: 10AOkoth) [14:47:29] (03Merged) 10jenkins-bot: network.tests:Use correct object for site [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802158 (owner: 10David Caro) [14:48:03] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, you can deploy it in a backport window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802443 (https://phabricator.wikimedia.org/T309686) (owner: 10Physikerwelt) [14:48:49] (03PS5) 10Jbond: wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 [14:49:15] RECOVERY - Memcached on idp2002 is OK: TCP OK - 0.033 second response time on 208.80.153.108 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [14:49:48] (03Merged) 10jenkins-bot: Fix CI not failing on "helm template" errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/802137 (owner: 10JMeybohm) [14:51:16] (03CR) 10JMeybohm: "recheck (sorry for using you as guinea pig 😊)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [14:52:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35683/console" [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [14:53:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: add new resource::capitalise function [puppet] - 10https://gerrit.wikimedia.org/r/802504 (owner: 10Jbond) [14:53:18] (03CR) 10Muehlenhoff: P::aptrepo::wikimedia install Apache for private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [14:53:41] (03CR) 10Jbond: [C: 03+2] wmflib: add resource reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802505 (owner: 10Jbond) [14:56:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:56:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:37] (03CR) 10KartikMistry: Update cxserver to 2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [14:59:23] (03PS1) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [14:59:51] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:55] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [15:02:12] (03PS2) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:05:51] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) [15:06:03] (03PS7) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [15:06:07] (03PS1) 10Muehlenhoff: Failover active IDP nodes to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802542 (https://phabricator.wikimedia.org/T308214) [15:06:14] !log start migration to gitlab1004 - T307142 [15:06:14] (03PS3) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:18] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [15:07:14] (03PS1) 10Muehlenhoff: Update spec file to use new bullseye nodes [puppet] - 10https://gerrit.wikimedia.org/r/802543 [15:10:15] (03PS4) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:11:31] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:11:37] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:11:44] ^ expected due to T307142 [15:11:45] T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 [15:12:11] (03PS5) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:12:59] !log gitlab migration to new hardware in progress [15:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:55] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [15:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:29] (03PS6) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:15:57] !log installing openssl security updates on stretch [15:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] (03PS7) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:16:47] (03CR) 10Alexandros Kosiaris: "Adding marostegui for their awareness regarding m5 starting to be used." [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [15:16:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [15:17:03] PROBLEM - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:17:12] (03PS8) 10Jbond: P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 [15:18:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35690/console" [puppet] - 10https://gerrit.wikimedia.org/r/802540 (owner: 10Jbond) [15:21:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:sretest: Test out new wmflib::resource::reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802540 (owner: 10Jbond) [15:22:43] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:31] !log installing cups security updates (client-side libs only) [15:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:33:16] (03PS1) 10Jbond: wmflib: Add debugging [puppet] - 10https://gerrit.wikimedia.org/r/802548 [15:34:02] (03PS2) 10Jbond: wmflib: Add debugging [puppet] - 10https://gerrit.wikimedia.org/r/802548 [15:34:47] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: make gitlab1004 new production instance [puppet] - 10https://gerrit.wikimedia.org/r/802150 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [15:34:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35692/console" [puppet] - 10https://gerrit.wikimedia.org/r/802548 (owner: 10Jbond) [15:39:04] (03PS3) 10Jbond: wmflib: Add debugging [puppet] - 10https://gerrit.wikimedia.org/r/802548 [15:40:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35693/console" [puppet] - 10https://gerrit.wikimedia.org/r/802548 (owner: 10Jbond) [15:41:52] (03PS4) 10Jbond: wmflib: Add debugging [puppet] - 10https://gerrit.wikimedia.org/r/802548 [15:42:03] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops, 10Patch-For-Review: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10dancy) 05Stalled→03Open [15:42:30] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops, 10Patch-For-Review: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10dancy) Fixed at tag 4.8.2. >>! In T309116#7975904, @JMeybohm wrote: > Probably missing dependencies: > ` > mwdebug1002:~$ scap pull > Traceback (most rece... [15:45:05] (03CR) 10Jelto: [C: 03+2] wikimedia.org: make gitlab1004 the new gitlab production host [dns] - 10https://gerrit.wikimedia.org/r/802473 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [15:45:08] (03CR) 10Dzahn: [C: 03+1] wikimedia.org: make gitlab1004 the new gitlab production host [dns] - 10https://gerrit.wikimedia.org/r/802473 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [15:46:40] (03CR) 10Herron: "LGTM overall, please see a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [15:47:07] (03CR) 10Jbond: [C: 03+2] wmflib: Add debugging [puppet] - 10https://gerrit.wikimedia.org/r/802548 (owner: 10Jbond) [15:49:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:49:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:49:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:50:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:50:41] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar), 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:50:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P29349 and previous config saved to /var/cache/conftool/dbconfig/20220602-155046-ladsgroup.json [15:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:24] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [15:56:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29350 and previous config saved to /var/cache/conftool/dbconfig/20220602-155640-ladsgroup.json [15:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:45] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [15:57:35] RECOVERY - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is OK: OK - Certificate gitlab.wikimedia.org will expire on Sun 14 Aug 2022 09:25:34 AM GMT +0000. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:57:50] ^ yay. [15:57:59] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [15:58:01] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 121722 bytes in 0.902 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:33] cloudsw* related DNS changes are currently in unmerged state [16:01:21] (03PS1) 10Jbond: wmflib: Test hack to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/802552 [16:01:56] gitlab just switched to dedicated hardware and is back up [16:02:08] (03PS2) 10Jbond: wmflib: Test hack to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/802552 [16:02:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35695/console" [puppet] - 10https://gerrit.wikimedia.org/r/802552 (owner: 10Jbond) [16:04:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:04:18] hi [16:04:24] here [16:04:57] here as well [16:05:22] here [16:05:25] yep [16:05:33] has someone ACKed it? [16:05:40] I acked it [16:05:41] looks like a blip that has allready cleared [16:05:46] thanks jbond [16:05:49] and jhathaway [16:05:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P29351 and previous config saved to /var/cache/conftool/dbconfig/20220602-160550-ladsgroup.json [16:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:56] tcp.timed_out [16:06:33] doubled but seems to have dropped again? [16:07:04] more than doubled [16:07:32] topranks: went from about ~15 to 125 but has normalised [16:07:55] yeah [16:08:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: Test hack to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/802552 (owner: 10Jbond) [16:08:14] we are still in gitlab migration but maintenance window ends now [16:08:16] traffic dropped in esams https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&viewPanel=2&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [16:09:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:10:41] I 've noticed some increases in RTTs in my smokeping to bastion hosts graphs too. Now normalizing again too. [16:10:50] a whole bunch of tls.cert_name_invalid's around the time it peaked [16:10:53] hmm of [16:11:06] topranks: wut? /o\ [16:11:08] there was a small peake from russia as well [16:11:29] https://intake-analytics.wikimedia.org/v1/events?hasty=true 3,332 in the last 30mins [16:11:31] interestingly, even 1.1.1.1 RTTs quadrupled [16:11:32] censorship test??? sukhe [16:11:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29352 and previous config saved to /var/cache/conftool/dbconfig/20220602-161145-ladsgroup.json [16:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:30] jbond: yeah, possibly. at least the RU spike and esams traffic drop... [16:12:52] hard to say though without more substantive data, and also the recovery [16:13:19] avck [16:14:07] (03PS2) 10Dzahn: backup: switch fileset for gitlab from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/800357 (https://phabricator.wikimedia.org/T274463) [16:14:58] (03PS1) 10Jbond: sretest: stop realising resources whil we fix up names [puppet] - 10https://gerrit.wikimedia.org/r/802557 [16:15:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [16:15:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] sretest: stop realising resources whil we fix up names [puppet] - 10https://gerrit.wikimedia.org/r/802557 (owner: 10Jbond) [16:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:33] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [16:17:10] (03PS1) 10Jbond: sretest: test reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802558 [16:19:00] (03CR) 10Jelto: [C: 03+2] backup: switch fileset for gitlab from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/800357 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [16:19:14] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [16:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:53] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:23] (03CR) 10Jbond: [C: 03+2] sretest: test reduce function [puppet] - 10https://gerrit.wikimedia.org/r/802558 (owner: 10Jbond) [16:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P29353 and previous config saved to /var/cache/conftool/dbconfig/20220602-162053-ladsgroup.json [16:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:05] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:24:06] (03Abandoned) 10Dzahn: gitlab::dump: backup files on gitlab1004 in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/800358 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [16:26:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29354 and previous config saved to /var/cache/conftool/dbconfig/20220602-162653-ladsgroup.json [16:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:33:33] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [16:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:38] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [16:38:51] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:36] (03PS1) 10Zabe: raid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) [16:39:38] (03PS1) 10Zabe: rabbitmq: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) [16:39:40] (03PS1) 10Zabe: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) [16:39:44] (03PS1) 10Zabe: puppet_stastd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802567 (https://phabricator.wikimedia.org/T308013) [16:39:46] (03PS1) 10Zabe: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802568 (https://phabricator.wikimedia.org/T308013) [16:39:48] (03PS1) 10Zabe: poolcounter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802569 (https://phabricator.wikimedia.org/T308013) [16:39:50] (03PS1) 10Zabe: pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802570 (https://phabricator.wikimedia.org/T308013) [16:39:52] (03PS1) 10Zabe: nftables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802571 (https://phabricator.wikimedia.org/T308013) [16:39:54] (03PS1) 10Zabe: netops: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) [16:41:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298560)', diff saved to https://phabricator.wikimedia.org/P29355 and previous config saved to /var/cache/conftool/dbconfig/20220602-164158-ladsgroup.json [16:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:03] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:42:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:42:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [16:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [16:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:15] (03CR) 10Dzahn: [C: 03+2] "thanks! https://puppet-compiler.wmflabs.org/pcc-worker1003/35696/" [puppet] - 10https://gerrit.wikimedia.org/r/791673 (owner: 10Dzahn) [16:43:29] deleting expired globalsign certs [16:44:56] (03CR) 10Dzahn: [C: 03+2] "btw, there are also keys in [puppetmaster1001:/srv] $ find . | grep globalsign" [puppet] - 10https://gerrit.wikimedia.org/r/791673 (owner: 10Dzahn) [16:47:14] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Thanks @Cmjohnson, I've emailed the list to Sipi for a quote. Once we receive that, I'll create a Coupa request, then we can schedule the pickup. After the vendor picks up all the equipment a... [16:47:35] (03CR) 10Dzahn: [C: 03+2] "thanks! https://puppet-compiler.wmflabs.org/pcc-worker1003/35697/" [puppet] - 10https://gerrit.wikimedia.org/r/791678 (owner: 10Dzahn) [16:47:41] (03PS2) 10Dzahn: delete expired digicert certs [puppet] - 10https://gerrit.wikimedia.org/r/791678 [16:47:55] !log deleting expired globalsign and digicert TLS certificates [16:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:09] (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4048 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [16:48:30] sigh [16:48:44] oof [16:49:00] well, the new mx queue alert works [16:49:12] hello [16:49:13] all the same 2 addresses it seems [16:49:34] no-reply@phabricator mass action? [16:49:42] here as well [16:49:55] I have ACKed it [16:50:02] sukhe: thanks [16:51:25] taking to private [16:51:48] (03PS10) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [16:51:50] (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [16:55:13] (03PS7) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [16:56:11] (03CR) 10CI reject: [V: 04-1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [16:58:09] (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4052 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [16:58:31] (03PS8) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) [16:59:38] !log mx1001 - deleted certain mails from the mail queue - reacting to mx alert [16:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:21] (03CR) 10Krinkle: [C: 04-1] GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [17:01:26] (03PS2) 10Krinkle: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [17:01:30] (03PS2) 10Krinkle: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [17:01:32] (03CR) 10Krinkle: [C: 03+1] GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [17:04:33] (03CR) 10Herron: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [17:09:27] !log restart logstash on apifeatureusage hosts [17:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:24] !log rolling restart of codfw logstash cluster [17:11:26] (03PS1) 10Mabualruz: Remove 6 deprecated ResourceLoader skin modules in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) [17:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:54] 10SRE, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10RobH) 05Stalled→03Resolved a:03RobH we have decom equipment in the rack there, but we can remove this from open tasks. It'll stay in netbox until it goes away, but this task can be closed imo. [17:12:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:12:32] (03CR) 10Dzahn: "worth switching it if ldap-corp goes away anyways? Does it still go away?" [puppet] - 10https://gerrit.wikimedia.org/r/791677 (owner: 10Dzahn) [17:14:43] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:53] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:15:07] ^ the uncommited changes are "cloudsw" [17:15:23] not sure if it means netops or wmcs but one of those [17:18:12] (03PS1) 10Dzahn: sre: update renamed otrs role to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) [17:19:38] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [17:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:24] (03CR) 10Ahmon Dancy: docker_registry_ha: Authorize GitLab trusted runners using JWT (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:21:30] (03PS1) 10Dzahn: vrts: adjust tests files to renamed role class [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942) [17:23:26] (03CR) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:25:36] (03CR) 10Ahmon Dancy: docker_registry_ha: Authorize GitLab trusted runners using JWT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:26:33] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:26:36] (03CR) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [17:31:48] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Cmjohnson) Chatted with @Marostegui and we are planning downtime for tomorrow 3 June [17:33:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) This is turning into a pain in the ass, HPE is using some 3rd party company that I've never heard of to do these installs, they never contacted me and then closed the ticket... [17:39:15] !log rolling restart of eqiad logstash cluster [17:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:46] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [17:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:50] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [17:40:03] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:48:15] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:05] jeena and dancy: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T1800). [18:04:07] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [18:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:27] (03PS1) 10Jeena Huneidi: all wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802583 [18:04:29] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802583 (owner: 10Jeena Huneidi) [18:05:18] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.14 refs T308067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802583 (owner: 10Jeena Huneidi) [18:08:50] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.14 refs T308067 [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:54] T308067: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 [18:10:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:46] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) >>! In T237033#7975492, @Krinkle wrote: > @thcipriani @dancy I believe the equivalent of the `beta-sc... [18:14:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:14:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29356 and previous config saved to /var/cache/conftool/dbconfig/20220602-181434-ladsgroup.json [18:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:37] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:14:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:15:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:49] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29357 and previous config saved to /var/cache/conftool/dbconfig/20220602-182145-ladsgroup.json [18:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:50] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:22:15] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:24:23] (03PS1) 10Jbond: wmflib: add import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802585 [18:28:45] (03CR) 10CI reject: [V: 04-1] wmflib: add import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802585 (owner: 10Jbond) [18:30:57] (03PS2) 10Jbond: wmflib: add import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802585 [18:35:57] (03CR) 10Jbond: [C: 03+2] wmflib: add import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802585 (owner: 10Jbond) [18:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29358 and previous config saved to /var/cache/conftool/dbconfig/20220602-183650-ladsgroup.json [18:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:04] (03PS1) 10Jbond: P:sretest: test new wmflib::import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802587 [18:43:12] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [18:43:12] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=99) [18:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:29] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [18:43:39] oh oh [18:45:33] (03PS2) 10Jbond: P:sretest: test new wmflib::import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802587 [18:47:40] (03PS3) 10Jbond: P:sretest: test new wmflib::import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802587 [18:48:30] jhathaway: any ongoing work on mx1001? [18:48:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35700/console" [puppet] - 10https://gerrit.wikimedia.org/r/802587 (owner: 10Jbond) [18:48:39] asking because puppet is disabled [18:48:48] (03PS1) 10Bking: elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 [18:49:00] sukhe: yes, is something wrong? [18:49:06] testing an exim patch [18:49:10] oh I see [18:49:12] yeah, got an alert [18:49:13] 14:43:30 <+icinga-wm> PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [18:49:25] oooh, thanks, let me take a look [18:49:29] thanks <3 [18:51:06] (03PS2) 10Ryan Kemper: elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 (owner: 10Bking) [18:51:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29359 and previous config saved to /var/cache/conftool/dbconfig/20220602-185155-ladsgroup.json [18:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:16] (03PS3) 10Bking: elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 [18:54:21] (03CR) 10Jdlrobson: [C: 04-1] "I think they are used by the performance team. If they are still relevant they will need to regenerate these." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz) [18:54:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:sretest: test new wmflib::import/export functions [puppet] - 10https://gerrit.wikimedia.org/r/802587 (owner: 10Jbond) [18:56:59] (03PS1) 10Jbond: wmflib::resource:export: export the resource not the title [puppet] - 10https://gerrit.wikimedia.org/r/802591 [18:57:01] (03CR) 10CI reject: [V: 04-1] elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 (owner: 10Bking) [18:59:16] (03PS2) 10D3r1ck01: Use a service locator to get a job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 [19:01:52] (03CR) 10Jbond: [C: 03+2] wmflib::resource:export: export the resource not the title [puppet] - 10https://gerrit.wikimedia.org/r/802591 (owner: 10Jbond) [19:02:46] (03PS4) 10Bking: elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 [19:05:57] (03CR) 10Gehel: [C: 03+2] elastic: add write_queue_datacenters option [cookbooks] - 10https://gerrit.wikimedia.org/r/802588 (owner: 10Bking) [19:07:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29360 and previous config saved to /var/cache/conftool/dbconfig/20220602-190701-ladsgroup.json [19:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:04] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:07:08] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [19:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:12] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [19:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:26] !log T305646 T308647 Unbanned `elastic2033` and `elastic2054` from clusters; also pooled `elastic2033` [19:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:31] T305646: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 [19:08:32] T308647: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 [19:10:02] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [19:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:07] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [19:20:11] (03PS1) 10Andrea Denisse: Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 [19:22:57] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:10] (03CR) 10Herron: Add role::netmon for netmon1003 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (owner: 10Andrea Denisse) [19:32:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [19:32:55] (03PS2) 10Andrea Denisse: Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 [19:36:16] (03PS1) 10Huji: Add tfj as a shortcut for toolforge-jobs command [puppet] - 10https://gerrit.wikimedia.org/r/802596 (https://phabricator.wikimedia.org/T309308) [19:36:28] (03PS3) 10Andrea Denisse: Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [19:36:31] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:47] (03CR) 10AOkoth: [C: 03+1] "Thanks. I missed these." [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:37:28] (03CR) 10CI reject: [V: 04-1] Add tfj as a shortcut for toolforge-jobs command [puppet] - 10https://gerrit.wikimedia.org/r/802596 (https://phabricator.wikimedia.org/T309308) (owner: 10Huji) [19:37:39] (03CR) 10CI reject: [V: 04-1] Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:38:50] (03PS2) 10Huji: Add tfj as a shortcut for toolforge-jobs command [puppet] - 10https://gerrit.wikimedia.org/r/802596 (https://phabricator.wikimedia.org/T309308) [19:44:09] (03PS4) 10Andrea Denisse: Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [19:45:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:45:04] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [19:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:08] (03CR) 10CI reject: [V: 04-1] Add role::netmon for netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:46:48] (03PS5) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) [19:49:25] (03PS1) 10Milimetric: Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) [19:50:39] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:53] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: experiment with vg_name setting [puppet] - 10https://gerrit.wikimedia.org/r/802599 [19:50:55] (03PS1) 10Andrew Bogott: put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) [19:52:26] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: experiment with vg_name setting [puppet] - 10https://gerrit.wikimedia.org/r/802599 (owner: 10Andrew Bogott) [19:52:43] (03CR) 10CI reject: [V: 04-1] Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [19:53:04] !log T294805 Marked `elastic10[68-83]` as Active in netbox (all except `elastic10[77,80]` were erroneously marked as `Staged`) [19:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:10] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [19:53:45] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [19:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [19:54:07] (03CR) 10Andrea Denisse: Add role::netmon to the netmon1003 instance. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:56:31] (03PS2) 10Bartosz Dziewoński: Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) [19:56:31] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:05] brennen: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220602T2000). [20:03:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) [20:04:35] (03PS1) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [20:04:45] 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) a:03BCornwall [20:04:47] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:24] (03PS1) 10Andrew Bogott: wmcs-image-create: Use openstack cli for creating new glance image [puppet] - 10https://gerrit.wikimedia.org/r/802605 [20:05:57] (03CR) 10Eevans: [C: 04-1] "This is not ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [20:07:14] !log no patches and no new trainees; closing utc late backport & config window [20:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host backup1009.mgmt.eqiad.wmnet with reboot policy FORCED [20:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:28] (03PS2) 10Andrew Bogott: put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) [20:12:30] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: one more attempt with vg_name [puppet] - 10https://gerrit.wikimedia.org/r/802627 [20:13:23] (03CR) 10Herron: "Question for netops -- Do we risk any side effects deploying this in parallel to netmon1002? Is there anything that should be silenced/di" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:14:02] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: one more attempt with vg_name [puppet] - 10https://gerrit.wikimedia.org/r/802627 (owner: 10Andrew Bogott) [20:14:23] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host clouddumps1001.wikimedia.org with OS bullseye [20:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [20:14:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:55] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [20:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [20:15:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [20:15:53] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:16:23] !log T306449 Marked `elastic1097` as `Staged` in Netbox (was previously failed, but fixed in https://phabricator.wikimedia.org/T306449#7888260) [20:16:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:27] T306449: hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 [20:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [20:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:26:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1009.mgmt.eqiad.wmnet with reboot policy FORCED [20:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] (03PS1) 10Jbond: wmflib: drop rrsource::reduce and add specs for resource::import [puppet] - 10https://gerrit.wikimedia.org/r/802629 [20:28:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson stat1009 B1 U17 cableid 1181 port 5 [20:29:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Jclark-ctr) [20:31:09] (03PS1) 10Samtar: gitignore: add vscode [puppet] - 10https://gerrit.wikimedia.org/r/802630 [20:35:26] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudnet1004 - https://phabricator.wikimedia.org/T309576 (10nskaggs) Yes, I agree. Let's focus on bringing the new machines online. [20:36:32] (03CR) 10Majavah: [C: 04-1] "I don't think this should be added here - different people use different editors, so instead of every single project having the editors of" [puppet] - 10https://gerrit.wikimedia.org/r/802630 (owner: 10Samtar) [20:37:19] I didn't know that was a thing! [20:37:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) [20:37:57] (03CR) 10Samtar: gitignore: add vscode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802630 (owner: 10Samtar) [20:38:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: re-label cloudstore101[01] to clouddumps100[12] - https://phabricator.wikimedia.org/T309338 (10Jclark-ctr) 05Open→03Resolved Relabeled Servers [20:38:20] (03Abandoned) 10Samtar: gitignore: add vscode [puppet] - 10https://gerrit.wikimedia.org/r/802630 (owner: 10Samtar) [20:42:33] (03PS2) 10Jbond: wmflib: drop rrsource::reduce and add specs for resource::import [puppet] - 10https://gerrit.wikimedia.org/r/802629 [20:43:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) @cmooney Let's arrange to move some machines so we can have more optimal routing. @dcaro, do you think it would be easier to move a ceph... [20:43:27] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:44:13] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10nskaggs) [20:44:25] (03PS1) 10Eevans: Dummy keys and certificates for cassandra (aqs) [labs/private] - 10https://gerrit.wikimedia.org/r/802631 (https://phabricator.wikimedia.org/T307801) [20:45:34] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [20:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [20:47:00] (03PS3) 10Jbond: wmflib: drop rsource::reduce and add specs for resource::import [puppet] - 10https://gerrit.wikimedia.org/r/802629 [20:47:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35708/console" [puppet] - 10https://gerrit.wikimedia.org/r/802629 (owner: 10Jbond) [20:50:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: drop rsource::reduce and add specs for resource::import [puppet] - 10https://gerrit.wikimedia.org/r/802629 (owner: 10Jbond) [20:51:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] wmflib: drop rsource::reduce and add specs for resource::import [puppet] - 10https://gerrit.wikimedia.org/r/802629 (owner: 10Jbond) [20:54:26] (03PS1) 10Jbond: P:sretest: Add merge parameter [puppet] - 10https://gerrit.wikimedia.org/r/802633 [20:55:26] (03CR) 10Jbond: [C: 03+2] P:sretest: Add merge parameter [puppet] - 10https://gerrit.wikimedia.org/r/802633 (owner: 10Jbond) [20:55:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:sretest: Add merge parameter [puppet] - 10https://gerrit.wikimedia.org/r/802633 (owner: 10Jbond) [20:56:40] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) a:03BCornwall [20:57:47] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:58:57] (03CR) 10JHathaway: [C: 03+2] mx: enable tainted data checking [puppet] - 10https://gerrit.wikimedia.org/r/801799 (https://phabricator.wikimedia.org/T286911) (owner: 10JHathaway) [20:59:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:59:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [20:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) [21:00:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [21:00:16] (03CR) 10Nskaggs: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [21:02:43] (03PS1) 10Cmjohnson: Adding backup1009 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/802636 (https://phabricator.wikimedia.org/T307048) [21:03:05] (03CR) 10CI reject: [V: 04-1] Adding backup1009 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/802636 (https://phabricator.wikimedia.org/T307048) (owner: 10Cmjohnson) [21:06:59] (03Abandoned) 10Cmjohnson: Adding backup1009 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/802636 (https://phabricator.wikimedia.org/T307048) (owner: 10Cmjohnson) [21:09:34] (03PS1) 10Zabe: Stop writing to cuc_actor on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802637 (https://phabricator.wikimedia.org/T233004) [21:10:15] (03PS1) 10Cmjohnson: Adding backup1009 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/802638 (https://phabricator.wikimedia.org/T307048) [21:11:27] (03CR) 10Cmjohnson: [C: 03+2] Adding backup1009 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/802638 (https://phabricator.wikimedia.org/T307048) (owner: 10Cmjohnson) [21:11:40] Does anyone have time to quickly deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/802637/ before the week ends? [21:11:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [21:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) [21:15:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [21:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:17] (03CR) 10Nskaggs: "Thank you for including a link / runbook as well!" [alerts] - 10https://gerrit.wikimedia.org/r/802442 (https://phabricator.wikimedia.org/T302377) (owner: 10Majavah) [21:19:15] zabe: Fine, let's do it. [21:19:24] (03CR) 10Jforrester: [C: 03+2] Stop writing to cuc_actor on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802637 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:20:00] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 55426 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [21:21:04] (03Merged) 10jenkins-bot: Stop writing to cuc_actor on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802637 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:24:40] zabe: Done. [21:25:08] James_F, thanks :) [21:25:31] Though scap seems to have got stuck on the PHP restart step? [21:25:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1009.eqiad.wmnet with OS bullseye [21:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1009.eqiad.wmn... [21:25:57] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Emergency deploy: [[gerrit:802637|Stop writing to cuc_actor on all wikis (T233004 T309737)]] (duration: 03m 15s) [21:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:01] T309737: CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="Qwqqwqq" - https://phabricator.wikimedia.org/T309737 [21:26:01] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:26:03] Finally. [21:26:23] "Finished php-fpm-restarts (duration: 02m 36s)" eesh. [21:27:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [21:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [21:28:23] Bank holidays are confusing. My head wants to think deploys are happening on a Saturday. [21:28:42] RhinosF1: I mean, they are; it's the first day of a weekend somewhere. [21:28:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:28:55] But also the clue's in the term "emergency deploy:" ;-) [21:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:30] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10nskaggs) Looping in @Andrew. @Kelson note that yes, we are installing new, more capable machines that... [21:29:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:59] (03CR) 10Dduvall: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [21:31:32] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [21:32:31] (03PS4) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [21:33:30] (03CR) 10CI reject: [V: 04-1] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [21:34:59] (03PS5) 10Dduvall: Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [21:39:57] (03PS1) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) [21:40:30] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#6884320, @cscott wrote: > database storage size, database column limits, etc, all scale with bytes not characters. We s... [21:44:00] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:44:30] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:45:11] (03PS2) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) [21:52:45] (03PS3) 10Andrew Bogott: put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) [21:52:47] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: further attempts [puppet] - 10https://gerrit.wikimedia.org/r/802649 [21:54:06] jouncebot nowandnext [21:54:07] No deployments scheduled for the next 9 hour(s) and 5 minute(s) [21:54:07] In 9 hour(s) and 5 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700) [21:54:10] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: further attempts [puppet] - 10https://gerrit.wikimedia.org/r/802649 (owner: 10Andrew Bogott) [21:54:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [21:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [21:55:35] (03PS3) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) [22:03:38] (03PS4) 10Andrew Bogott: put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) [22:03:40] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/802652 [22:05:12] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: etc [puppet] - 10https://gerrit.wikimedia.org/r/802652 (owner: 10Andrew Bogott) [22:08:10] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host clouddumps1001.wikimedia.org with OS bullseye [22:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:20] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:08:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [22:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:28] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [22:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [22:08:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [22:08:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [22:22:23] (03Abandoned) 10Thcipriani: Beta: Clean puppet cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [22:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [22:23:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [22:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T298560)', diff saved to https://phabricator.wikimedia.org/P29363 and previous config saved to /var/cache/conftool/dbconfig/20220602-222306-ladsgroup.json [22:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:10] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [22:28:35] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:28:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:29:20] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [22:29:48] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: further tiny tweaks [puppet] - 10https://gerrit.wikimedia.org/r/802656 [22:30:10] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) [22:30:25] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) 05Open→03Resolved As this host is in a caching site, we have no out of rack storage. It will simply sit powered down in the rack until ulsfo is refreshed and it is replaced... [22:31:05] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: further tiny tweaks [puppet] - 10https://gerrit.wikimedia.org/r/802656 (owner: 10Andrew Bogott) [22:31:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [22:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [22:31:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [22:33:08] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [22:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [22:43:03] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [22:50:58] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [22:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:09] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: one last stab before I give up for the day [puppet] - 10https://gerrit.wikimedia.org/r/802661 [22:51:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [22:52:40] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: one last stab before I give up for the day [puppet] - 10https://gerrit.wikimedia.org/r/802661 (owner: 10Andrew Bogott) [22:53:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [22:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [23:07:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:14:23] (03PS12) 10Tim Starling: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [23:14:26] (03PS1) 10Brion VIBBER: Disable older WebM VP8 transcodes except 360p [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802665 (https://phabricator.wikimedia.org/T309823) [23:14:38] (03PS4) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) [23:22:12] (03CR) 10Tim Starling: [C: 03+2] Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [23:22:22] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [23:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [23:22:56] (03Merged) 10jenkins-bot: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [23:25:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:26:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:16] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: Add db-mainstash g 752807 (duration: 03m 24s) [23:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:48] (03PS1) 10Andrew Bogott: clouddumps: try a different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/802666 [23:29:49] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:29:52] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: try a different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/802666 (owner: 10Andrew Bogott) [23:30:24] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [23:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [23:37:52] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [23:42:24] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [23:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [23:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:29] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [23:52:50] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:56:15] (03PS1) 10Andrew Bogott: clouddumps100x: yet another partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/802667 [23:56:45] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host clouddumps1001.wikimedia.org with OS bullseye [23:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [23:58:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [23:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w...