[00:00:05] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T0000). [00:00:05] Jdlrobson and samwilson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:21] present :) [00:00:27] Hullo, I'm here. [00:00:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:00:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:04] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Platonides) I would have expected the wikitech timeline to contain a final entry for "mx2001" back into (i.e. T297128) Plus, it doesn't list what the l... [00:03:22] Is there a backporter available for @samwilson and I? [00:03:34] I can do it [00:03:54] thanks! [00:04:27] (03PS2) 10Catrope: Enable Disambiguator notifications on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753584 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [00:04:31] (03CR) 10Catrope: [C: 03+2] Enable Disambiguator notifications on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753584 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [00:05:02] Thanks @RoanKattouw [00:05:40] (03Merged) 10jenkins-bot: Enable Disambiguator notifications on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753584 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [00:06:29] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:50] samwilson: Your patch is on mwdebug1002, please test [00:08:07] thanks. testing now. [00:09:46] RoanKattouw: yep, all is good, go for it. [00:10:30] (03CR) 10Catrope: [C: 03+2] Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [00:11:31] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753584|Enable Disambiguator notifications on all wikis (T293319)]] (duration: 01m 28s) [00:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:35] T293319: Rollout plan for disambiguation notifications (wgDisambiguatorNotifications) - https://phabricator.wikimedia.org/T293319 [00:11:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:05] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:14:55] (03PS6) 10Catrope: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [00:15:02] (03CR) 10Catrope: [C: 03+2] Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [00:16:15] (03Merged) 10jenkins-bot: Skip vector-2022 skin in config, not Vector skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752760 (https://phabricator.wikimedia.org/T298923) (owner: 10Jdlrobson) [00:17:45] yipee [00:18:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:18:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:37] Jdlrobson: Your first patch (vector-2022) is on mwdebug1002, please test [00:19:52] (it should also automatically be deployed on beta some time in the next 15ish minutes) [00:20:11] (03CR) 10Krinkle: "recheck" [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753498 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [00:20:19] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:20:55] RoanKattouw: looking [00:22:17] RoanKattouw: LGTM in production sync away [00:23:07] (03PS3) 10Catrope: Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 (owner: 10Jdlrobson) [00:23:53] (03CR) 10Catrope: [C: 03+2] Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 (owner: 10Jdlrobson) [00:24:17] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752760|Skip vector-2022 skin in config, not Vector skin (T298923)]] (duration: 01m 29s) [00:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:23] T298923: Hide new Vector via configuration rather than hardcoded in Vector skin - https://phabricator.wikimedia.org/T298923 [00:24:59] (03Merged) 10jenkins-bot: Enable CirrusSearch on it/en Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752751 (owner: 10Jdlrobson) [00:25:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:16] Jdlrobson: Your other patch is on mwdebug1002, please test [00:29:47] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) @ayounsi corrected et-0/1/0 Rolled fiber. has link. i still have a few ends i need to do on cables to finish up msw in racks. i had only put those connections as temp. but should have finis... [00:29:58] RoanKattouw: testing [00:30:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:16] RoanKattouw: works! [00:30:21] RoanKattouw: please sync :) [00:30:35] Syncing [00:31:59] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:752751|Enable CirrusSearch on it/en Wikivoyage]] (duration: 01m 28s) [00:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:34:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [00:36:24] Thanks RoanKattouw [00:36:44] beta cluster working too :) [00:41:05] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:45:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [00:51:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:15] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:27] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:05] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T0100). [01:29:45] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:30:18] Hi, I can't seem to revert a caption edit in Commons, it is returning this internal error: [fbdf20a3-78d0-45ef-90e7-3e2b3dbcac14] 2022-01-13 01:29:02: Fatal exception of type "LogicException" [01:30:42] This is what I'm trying to revert: https://commons.wikimedia.org/w/index.php?title=File:Editwar.png&diff=621282184&oldid=616178398 [01:33:42] btw when I said "caption edit", I mean structured data [01:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [01:40:54] job: if you're comfortable doing so, would you mind opening a task in Phabricator with that information? [01:41:07] legoktm: Did it now - https://phabricator.wikimedia.org/T299111 [01:41:31] thanks! I'm looking up the full stacktrace now [01:42:05] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:42:19] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:44:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [01:51:59] (03PS2) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [01:55:23] (03CR) 10jerkins-bot: [V: 04-1] nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [02:00:11] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:58] (03PS3) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [02:03:51] (03CR) 10jerkins-bot: [V: 04-1] nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [02:06:03] (03PS4) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [02:07:01] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:41] (03PS1) 10Tim Starling: maintenance: Add --batch-size to sql.php [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753499 (https://phabricator.wikimedia.org/T299095) [02:08:51] (03CR) 10Tim Starling: [C: 03+2] maintenance: Add --batch-size to sql.php [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753499 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [02:09:14] (03CR) 10jerkins-bot: [V: 04-1] nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [02:10:46] (03PS5) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [02:19:49] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10Legoktm) [02:26:28] (03Merged) 10jenkins-bot: maintenance: Add --batch-size to sql.php [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753499 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [02:27:19] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - enwiki_content_1617305154[9](2022-01-09T22:34:46.086Z), enwiki_content_1617305154[10](2022-01-09T22:34:46.216Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:29:47] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.16/maintenance/sql.php: batch size (duration: 01m 28s) [02:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:02] !log on mwmaint1002: inserting 4221344 rows into commonswiki.pagelinks to clean up from T299095 [02:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:05] T299095: Links tables corrupted due to incorrectly parenthesized delete queries - https://phabricator.wikimedia.org/T299095 [02:31:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:32:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [02:33:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [02:43:27] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:48:13] (03PS6) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [03:33:10] !log on mwmaint1002: inserting 1714288 into wikidatawiki.pagelinks for T299095 [03:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:14] T299095: Links tables corrupted due to incorrectly parenthesized delete queries - https://phabricator.wikimedia.org/T299095 [03:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [03:38:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [03:40:39] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:50] !log on mwmaint1002: inserting 11565 rows into itwiki.pagelinks for T299095 [04:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:53] T299095: Links tables corrupted due to incorrectly parenthesized delete queries - https://phabricator.wikimedia.org/T299095 [04:34:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [04:36:49] (03PS1) 10Andrew Bogott: nfs::standalone: Don't attach network interface unless volume is attached [puppet] - 10https://gerrit.wikimedia.org/r/753611 (https://phabricator.wikimedia.org/T293800) [04:38:44] (03CR) 10Andrew Bogott: [C: 03+2] nfs::standalone: Don't attach network interface unless volume is attached [puppet] - 10https://gerrit.wikimedia.org/r/753611 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [04:39:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [04:41:43] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:45:59] (03PS7) 10Andrew Bogott: nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) [04:46:01] (03PS1) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [04:48:54] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [04:59:52] (03CR) 10Andrew Bogott: [C: 03+2] nfs/add_server: include the option to create and attach a service ip/fqdn [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753206 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [05:00:44] !log doing T299095 restorations on s3 wikis [05:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:48] T299095: Links tables corrupted due to incorrectly parenthesized delete queries - https://phabricator.wikimedia.org/T299095 [05:31:47] (03CR) 10Tim Starling: [C: 03+2] Database::factorConds(): fix insufficient parenthesization [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753498 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [05:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [05:38:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [05:50:12] (03Merged) 10jenkins-bot: Database::factorConds(): fix insufficient parenthesization [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753498 (https://phabricator.wikimedia.org/T299095) (owner: 10Tim Starling) [05:53:57] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.17/tests/phpunit/unit/includes/libs/rdbms/database/DatabaseSQLTest.php: (no justification provided) (duration: 01m 32s) [05:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [05:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: repooling after maintenance and reimage', diff saved to https://phabricator.wikimedia.org/P18711 and previous config saved to /var/cache/conftool/dbconfig/20220113-055602-root.json [05:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [05:56:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [05:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [05:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:31] there were local changes in deploy1002 php-1.38.0-wmf.17 , trying to sort it out [06:05:10] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.17/includes/libs/rdbms/database/Database.php: (no justification provided) (duration: 01m 27s) [06:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: repooling after maintenance and reimage', diff saved to https://phabricator.wikimedia.org/P18712 and previous config saved to /var/cache/conftool/dbconfig/20220113-061105-root.json [06:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:09] (03PS1) 10Ladsgroup: export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753501 (https://phabricator.wikimedia.org/T163532) [06:13:15] (03CR) 10Ladsgroup: [C: 03+2] export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753501 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [06:20:09] (03PS1) 10Legoktm: Update legoktm's email address [puppet] - 10https://gerrit.wikimedia.org/r/753614 [06:26:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: repooling after maintenance and reimage', diff saved to https://phabricator.wikimedia.org/P18713 and previous config saved to /var/cache/conftool/dbconfig/20220113-062609-root.json [06:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:27] !log Remove rev_page_id from frwiki,jawiki,ruwiki and labswiki from db1096 (s6) T285149 [06:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:30] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:32:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [06:33:33] (03PS1) 10Marostegui: db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753615 [06:34:13] (03CR) 10Marostegui: [C: 03+2] db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753615 (owner: 10Marostegui) [06:34:15] (03Merged) 10jenkins-bot: export: Remove ignoring rev_page_id index [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753501 (https://phabricator.wikimedia.org/T163532) (owner: 10Ladsgroup) [06:36:46] (03PS1) 10Marostegui: wmnet: Failover m3-master to dbproxy1020 [dns] - 10https://gerrit.wikimedia.org/r/753616 (https://phabricator.wikimedia.org/T298586) [06:37:50] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master to dbproxy1020 [dns] - 10https://gerrit.wikimedia.org/r/753616 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [06:37:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [06:38:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:38:13] !log Failover m3 proxy from dbproxy1016 to dbproxy1020 T298586 [06:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:17] T298586: Upgrade all dbproxy hosts to Bullseye - https://phabricator.wikimedia.org/T298586 [06:39:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:39:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: repooling after maintenance and reimage', diff saved to https://phabricator.wikimedia.org/P18714 and previous config saved to /var/cache/conftool/dbconfig/20220113-064113-root.json [06:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:24] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.16/includes/export/WikiExporter.php: Backport: [[gerrit:753501|export: Remove ignoring rev_page_id index (T163532)]] (duration: 01m 28s) [06:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:27] T163532: Drop index rev_page_id (rev_page, rev_id) - https://phabricator.wikimedia.org/T163532 [06:42:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:39] (03PS2) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [06:59:17] (03PS1) 10Marostegui: dbproxy1015: Reimage to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753617 (https://phabricator.wikimedia.org/T298586) [07:01:23] (03CR) 10Marostegui: [C: 03+2] dbproxy1015: Reimage to Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753617 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [07:03:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1015.eqiad.wmnet with OS bullseye [07:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:51] (03PS1) 10Ladsgroup: Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753504 (https://phabricator.wikimedia.org/T299111) [07:17:06] (03PS1) 10Ladsgroup: Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753505 (https://phabricator.wikimedia.org/T299111) [07:24:28] (03PS1) 10Marostegui: Revert "db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753626 [07:25:32] (03CR) 10Marostegui: [C: 03+2] Revert "db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753626 (owner: 10Marostegui) [07:29:00] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:30:22] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:31:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:32:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1015.eqiad.wmnet with OS bullseye [07:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:06] (03PS6) 10ArielGlenn: Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [07:34:18] (03CR) 10ArielGlenn: [C: 03+2] Dumps: Clarify licensing for Wikidata and update various links [puppet] - 10https://gerrit.wikimedia.org/r/730243 (https://phabricator.wikimedia.org/T279436) (owner: 10Dylsss) [07:36:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:39:12] (03CR) 10Ladsgroup: [C: 03+2] Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753505 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [07:39:19] (03CR) 10Ladsgroup: [C: 03+2] Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753504 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [07:50:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked group from s7 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18715 and previous config saved to /var/cache/conftool/dbconfig/20220113-075012-marostegui.json [07:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:16] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [07:51:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10elukey) a:05elukey→03Cmjohnson Hi folks, I am not an expert in partman recipes too, but I can add some notes: * Puppet's `netboot.cfg`... [07:57:35] !log stop kafka* on kafka-main1003 as prep-step for reimage to buster [07:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:23] (03CR) 10jerkins-bot: [V: 04-1] Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753505 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [07:59:17] (03CR) 10Ladsgroup: [C: 03+2] "..." [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753505 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [08:02:36] !log ipmi mc reset cold for kafka-main1003, mgmt interface not reachable via ssh [08:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:48] always a joy [08:03:35] (03Merged) 10jenkins-bot: Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753504 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [08:06:20] !log Change innodb_checksum_algorithm=full_crc32 on eqiad sanitarium hosts (db1154, db1155) T287244 [08:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:24] T287244: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 [08:08:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1003.eqiad.wmnet with OS buster [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:09:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:19] (03PS1) 10Marostegui: db_inventory.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/753678 (https://phabricator.wikimedia.org/T287244) [08:13:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main1001 is CRITICAL: 479 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1001 [08:14:25] (03CR) 10Marostegui: [C: 03+2] db_inventory.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/753678 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [08:14:50] downtimed the alert above, it is expected with one node down [08:15:02] the reimage is in progress :) [08:16:11] (03Merged) 10jenkins-bot: Take LogicException into consideration [extensions/SpamBlacklist] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/753505 (https://phabricator.wikimedia.org/T299111) (owner: 10Ladsgroup) [08:21:36] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.17/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:753504|Take LogicException into consideration (T299111)]] (duration: 01m 28s) [08:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] T299111: Cannot revert a structured data edit in Commons - https://phabricator.wikimedia.org/T299111 [08:23:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:27:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:37] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/SpamBlacklist/includes/SpamBlacklistHooks.php: Backport: [[gerrit:753505|Take LogicException into consideration (T299111)]] (duration: 01m 28s) [08:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:40] T299111: Cannot revert a structured data edit in Commons - https://phabricator.wikimedia.org/T299111 [08:28:55] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10MoritzMuehlenhoff) Interesting, thanks for reverting quickly! So the mail issues in the 2012-12-03 weren't just a Heisenbug after all, but we'll probably need a less product... [08:31:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:34:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges group from s7 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18717 and previous config saved to /var/cache/conftool/dbconfig/20220113-083923-marostegui.json [08:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:28] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [08:39:35] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1001 [08:39:59] !log ipmi mc reset cold for kafka-main1002, mgmt interface not reachable via ssh [08:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:42:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1003.eqiad.wmnet with OS buster [08:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:05] \o/ [08:42:47] will proceed with 1002 later on [08:43:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:17] PROBLEM - puppet last run on gitlab2001 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:54:29] (03PS1) 10Jelto: gitlab::restore: restore after rsync of backup [puppet] - 10https://gerrit.wikimedia.org/r/753680 (https://phabricator.wikimedia.org/T274463) [08:57:39] (03PS3) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 [08:57:47] (03CR) 10DCausse: sre.wdqs.data-reload: few fixes and cleanups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [08:58:08] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753681 (https://phabricator.wikimedia.org/T295965) [08:58:39] 10SRE-tools, 10Infrastructure-Foundations, 10netbox: Netbox Reports Ideas and Requests - https://phabricator.wikimedia.org/T222931 (10ayounsi) >>! In T222931#7617721, @Dzahn wrote: > [...] Cf. T283483 [08:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022, give weight to es1021 T295965 ', diff saved to https://phabricator.wikimedia.org/P18718 and previous config saved to /var/cache/conftool/dbconfig/20220113-085906-marostegui.json [08:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:11] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [08:59:18] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753681 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [09:00:05] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:00:41] RECOVERY - puppet last run on gitlab2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:00:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM lists1001.wikimedia.org [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:49] !log rebooting lists1001 (running lists.wikimedia.org) to pick up new KVM setting [09:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM lists1001.wikimedia.org [09:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:53] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:25] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10ayounsi) > @ayounsi corrected et-0/1/0 Rolled fiber. has link. Nice! and LLDP shows msw2 as neighbor. However I'm still unable to SSH to msw2, is the mgmt cable connected? ("From the picture, ge-0/0/0... [09:08:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:14:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM xhgui1001.eqiad.wmnet [09:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM xhgui1001.eqiad.wmnet [09:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:40] (03PS1) 10Ayounsi: Various reports improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753684 [09:19:25] (03CR) 10jerkins-bot: [V: 04-1] Various reports improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753684 (owner: 10Ayounsi) [09:19:57] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:34] (03PS2) 10Ayounsi: Various reports improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753684 [09:23:27] (03CR) 10Ayounsi: [C: 03+2] Various reports improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753684 (owner: 10Ayounsi) [09:24:40] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1022.eqiad.wmnet with OS bullseye [09:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [09:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:40] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1002.eqiad.wmnet with OS buster [09:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1022.eqiad.wmnet with OS bullseye [09:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [09:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:32:40] !log joal@deploy1002 Started deploy [analytics/refinery@94ec386] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@94ec386] [09:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:42] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1022.eqiad.wmnet with OS bullseye [09:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] !log joal@deploy1002 Finished deploy [analytics/refinery@94ec386] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@94ec386] (duration: 06m 59s) [09:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] !log joal@deploy1002 Started deploy [analytics/refinery@94ec386] (thin): Hotfix analytics deploy THIN [analytics/refinery@94ec386] [09:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:30] !log joal@deploy1002 Finished deploy [analytics/refinery@94ec386] (thin): Hotfix analytics deploy THIN [analytics/refinery@94ec386] (duration: 00m 07s) [09:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:44] !log joal@deploy1002 Started deploy [analytics/refinery@94ec386]: Hotfix analytics deploy [analytics/refinery@94ec386] [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:42:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:13] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:43:59] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [09:45:40] (03PS1) 10David Caro: wmcs.backups: ignore cloudinfra-nfs project [puppet] - 10https://gerrit.wikimedia.org/r/753688 (https://phabricator.wikimedia.org/T299120) [09:45:42] (03PS1) 10David Caro: wmcs.backups: sort alphabetically the entries [puppet] - 10https://gerrit.wikimedia.org/r/753689 (https://phabricator.wikimedia.org/T299120) [09:46:57] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1022.eqiad.wmnet with OS bullseye [09:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:59] (03PS1) 10Btullis: Failover the hive services to the replicator coordinator [dns] - 10https://gerrit.wikimedia.org/r/753692 (https://phabricator.wikimedia.org/T297468) [09:49:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [09:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:39] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:51:25] 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10MatthewVernon) I think that failure is unrelated to this issue - looking at `/var/log/swiftrepl/2022-01-10-repl-commons.log`: ` 2022-01-10T08:00:01.713061 Traceback (most recent call la... [09:51:49] (03CR) 10Btullis: [C: 03+2] Failover the hive services to the replicator coordinator [dns] - 10https://gerrit.wikimedia.org/r/753692 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [09:53:47] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 396 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:55:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backups: ignore cloudinfra-nfs project [puppet] - 10https://gerrit.wikimedia.org/r/753688 (https://phabricator.wikimedia.org/T299120) (owner: 10David Caro) [09:55:47] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:55:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backups: sort alphabetically the entries [puppet] - 10https://gerrit.wikimedia.org/r/753689 (https://phabricator.wikimedia.org/T299120) (owner: 10David Caro) [09:56:12] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1006.eqiad.wmnet [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:34] (03CR) 10David Caro: [C: 03+2] wmcs.backups: ignore cloudinfra-nfs project [puppet] - 10https://gerrit.wikimedia.org/r/753688 (https://phabricator.wikimedia.org/T299120) (owner: 10David Caro) [09:56:37] (03CR) 10David Caro: [C: 03+2] wmcs.backups: sort alphabetically the entries [puppet] - 10https://gerrit.wikimedia.org/r/753689 (https://phabricator.wikimedia.org/T299120) (owner: 10David Caro) [09:57:33] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:59:37] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10MatthewVernon) [09:59:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1002.eqiad.wmnet with OS buster [09:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1006.eqiad.wmnet [10:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:29] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [10:01:15] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:01:42] 10SRE, 10SRE-swift-storage: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122 (10MatthewVernon) Split off into separate task, since this is something else going awry. [10:02:20] !log run kafka preferred-replica-election on kafka-main1001 to force a rebalance of partition leaders (after kafka-main1002's reimage) [10:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:31] !log joal@deploy1002 Finished deploy [analytics/refinery@94ec386]: Hotfix analytics deploy [analytics/refinery@94ec386] (duration: 21m 47s) [10:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:47] !log cp3052: upgrade varnish to 6.0.9-1wm1 T298758 [10:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:50] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [10:03:13] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:09:40] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1022.eqiad.wmnet with OS bullseye [10:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10mfossati) Thanks for your action @cmooney . I've tried to access all the blue-linked services in https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#wmf_group . The following ones are failin... [10:10:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM grafana1002.eqiad.wmnet [10:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:28] !log rebooting grafana1002 (running grafana.wikimedia.org) [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:49] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM grafana1002.eqiad.wmnet [10:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:50] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10MatthewVernon) This is a consequence of two things: 1. T296767 2. `/srv/software/swiftrepl/swiftrepl.conf` being hand-edited I've confirmed that `swiftrepl.conf` has the previous `mw:media` eqiad key in. [10:22:09] 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) [10:22:21] 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) p:05Triage→03Medium [10:26:02] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:25] !log systemctl reset-failed ifup@ens5.service on lists1001 T273026 [10:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:28] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:27:33] I have messed up the CI Jenkins upgrade yesterday, it involved a manual migration step which I have missed [10:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18719 and previous config saved to /var/cache/conftool/dbconfig/20220113-102734-root.json [10:27:36] so I am going to do it now [10:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:51] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10MatthewVernon) I've updated manually the relevant key on ms-fe1005 and swiftrepl is now going. I've also done the same edit on ms-fe2005. We should probably puppetize `swiftrepl.conf`. [10:27:54] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:00] hashar: out of interest, what was missing? [10:28:15] so security wise it is all fine [10:28:51] but the new release bring a new feature, they are renaming the agent running jobs inside jenkins from `master` to `built-in` [10:29:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM irc1001.wikimedia.org [10:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:13] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753629 [10:29:18] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:18] so if one has a job tied to run on `master` it would no more work if the migration was applied automatically [10:29:34] it thus has to be done manually after all occurences of `master` in the config have been changed to `built-in` [10:29:46] luckily, we don't use that built-in runner on our instances :] [10:29:55] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1001.eqiad.wmnet with OS buster [10:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:57] (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753629 (owner: 10Marostegui) [10:31:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM irc1001.wikimedia.org [10:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:32:47] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10jbond) 05Open→03Resolved a:03jbond I have cleaned this up, seems like an old build environment hadn't torne its self down properly. i have manually cleaned up [10:32:58] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [10:33:31] (03PS3) 10Cparle: Revert "Undo update to the way the search interface is set" [extensions/MediaSearch] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753487 [10:35:40] 10SRE-swift-storage: `swiftrepl.conf` should be puppet-managed - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [10:36:00] 10SRE-swift-storage: `swiftrepl.conf` should be puppet-managed - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) p:05Triage→03Medium [10:36:30] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10MatthewVernon) [10:36:32] 10SRE-swift-storage: `swiftrepl.conf` should be puppet-managed - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [10:36:50] (03CR) 10Jbond: [C: 03+2] "will merge" [puppet] - 10https://gerrit.wikimedia.org/r/753614 (owner: 10Legoktm) [10:36:56] 10SRE-swift-storage: `swiftrepl.conf` should be puppet-managed - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [10:37:00] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [10:37:25] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [10:38:22] (03CR) 10Jbond: [C: 03+1] osm::usergrants: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751162 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:38:44] (03CR) 10Jbond: [C: 03+1] mcrouter::monitoring: remove module [puppet] - 10https://gerrit.wikimedia.org/r/751136 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:39:20] (03CR) 10Jbond: [C: 03+1] parsoid: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:40:55] (03PS1) 10Marostegui: es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753698 (https://phabricator.wikimedia.org/T295965) [10:41:43] (03CR) 10Marostegui: [C: 03+2] es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753698 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [10:41:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [10:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18720 and previous config saved to /var/cache/conftool/dbconfig/20220113-104238-root.json [10:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2022.codfw.wmnet with OS bullseye [10:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM search-loader1001.eqiad.wmnet [10:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:19] (03PS1) 10Jbond: Revert "bgpalerter: update hiera" [puppet] - 10https://gerrit.wikimedia.org/r/753630 [10:45:27] (03PS2) 10Jbond: Revert "bgpalerter: update hiera" [puppet] - 10https://gerrit.wikimedia.org/r/753630 [10:45:40] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1007.eqiad.wmnet [10:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:07] (03CR) 10jerkins-bot: [V: 04-1] Revert "bgpalerter: update hiera" [puppet] - 10https://gerrit.wikimedia.org/r/753630 (owner: 10Jbond) [10:46:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM search-loader1001.eqiad.wmnet [10:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:36] (03PS3) 10Jbond: Revert "bgpalerter: update hiera" [puppet] - 10https://gerrit.wikimedia.org/r/753630 [10:46:46] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main1005 is CRITICAL: 54 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1005 [10:47:08] ah this is me, downtime expired [10:47:12] last node being reimaged [10:47:15] (expected) [10:47:28] (03PS7) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [10:47:41] (03PS1) 10Ayounsi: LibreNMS report improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 [10:47:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1007.eqiad.wmnet [10:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:19] (03CR) 10jerkins-bot: [V: 04-1] LibreNMS report improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 (owner: 10Ayounsi) [10:49:08] (03PS2) 10Ayounsi: LibreNMS report improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 [10:50:08] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 (owner: 10Ayounsi) [10:50:47] (03Merged) 10jenkins-bot: LibreNMS report improvments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753699 (owner: 10Ayounsi) [10:52:16] !log Restarting Jenkins CI for plugins update T298691 [10:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:19] T298691: 2022-01-12 Jenkins security advisory pre-announcement - https://phabricator.wikimedia.org/T298691 [10:53:11] moritzm: looks like a success :] [10:53:39] (03PS7) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [10:55:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main1003 is CRITICAL: 390 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1003 [10:56:39] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1008.eqiad.wmnet [10:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1003 [10:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18721 and previous config saved to /var/cache/conftool/dbconfig/20220113-105741-root.json [10:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1008.eqiad.wmnet [10:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netboxdb1001.eqiad.wmnet [10:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T1100). [11:02:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netboxdb1001.eqiad.wmnet [11:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1001.eqiad.wmnet with OS buster [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netbox1001.wikimedia.org [11:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:53] !log rebooting netbox1001 (running netbox.wikimedia.org) [11:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:12] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1009.eqiad.wmnet [11:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:11] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) [11:11:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox1001.wikimedia.org [11:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1009.eqiad.wmnet [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18722 and previous config saved to /var/cache/conftool/dbconfig/20220113-111245-root.json [11:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1005 [11:16:26] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [11:16:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testreduce1001.eqiad.wmnet [11:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) I did the following test: # - Try to upload the image via Special:Upload to testwiki using "upload via url", which currently has `wmgUsePage... [11:18:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2022.codfw.wmnet with OS bullseye [11:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:28] (03PS9) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:20:29] (03PS7) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:20:32] (03PS7) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:20:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testreduce1001.eqiad.wmnet [11:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:28] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:23:04] (03PS8) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [11:23:06] (03PS10) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:23:08] (03PS8) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:23:10] (03PS8) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:23:12] (03PS1) 10Arturo Borrero Gonzalez: wmcs: README: update example [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753704 [11:23:30] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:23:55] !log oblivian@deploy1002 Started deploy [restbase/deploy@0848b15]: (no justification provided) [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:04] !log oblivian@deploy1002 Finished deploy [restbase/deploy@0848b15]: (no justification provided) (duration: 00m 09s) [11:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] !log oblivian@deploy1002 Started deploy [restbase/deploy@0848b15]: scap testing [11:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:58] !log oblivian@deploy1002 Finished deploy [restbase/deploy@0848b15]: scap testing (duration: 00m 09s) [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:23] <_joe_> !log update scap everywhere T298986 [11:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:26] T298986: Deploy Scap version 4.1.1 - https://phabricator.wikimedia.org/T298986 [11:26:30] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM kafka-test1010.eqiad.wmnet [11:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18723 and previous config saved to /var/cache/conftool/dbconfig/20220113-112749-root.json [11:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:56] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [11:28:47] (03CR) 10David Caro: [C: 03+1] wmcs: README: update example (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753704 (owner: 10Arturo Borrero Gonzalez) [11:29:00] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: hive-server2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:52] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:31:14] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [11:31:36] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:33:42] (03PS1) 10Marostegui: x2 hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753707 (https://phabricator.wikimedia.org/T298769) [11:34:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kafka-test1010.eqiad.wmnet [11:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:22] PROBLEM - Check systemd state on kafka-test1010 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:41] (03CR) 10Marostegui: [C: 03+2] x2 hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753707 (https://phabricator.wikimedia.org/T298769) (owner: 10Marostegui) [11:34:44] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: grid: factorized node creation cookbook (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:35:02] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [11:38:38] (03CR) 10David Caro: "Got a couple questions, otherwise ok" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:40:18] (03CR) 10David Caro: "Anly the path/name thing" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:40:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: README: update example [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753704 (owner: 10Arturo Borrero Gonzalez) [11:41:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [11:42:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18724 and previous config saved to /var/cache/conftool/dbconfig/20220113-114252-root.json [11:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:49] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10cmooney) @bking Any update? Let me know if it's working and I can close the task, or otherwise review the setup. thanks. [11:46:24] (03PS1) 10Btullis: Fail back the hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/753709 (https://phabricator.wikimedia.org/T297468) [11:46:58] (03CR) 10David Caro: "Mostly one question about affinity and such" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:51:54] (03PS9) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [11:51:56] (03PS11) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:51:58] (03PS9) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:52:00] (03PS9) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:52:41] (03PS10) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [11:52:43] (03PS12) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:52:45] (03PS10) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [11:52:47] (03PS10) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [11:54:33] (03CR) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [11:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18725 and previous config saved to /var/cache/conftool/dbconfig/20220113-115756-root.json [11:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:35] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM eventlog1003.eqiad.wmnet [11:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:54] (03PS11) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) [11:59:56] (03PS13) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) [11:59:58] (03PS11) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) [12:00:00] (03PS11) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [12:00:04] Amir1, Lucas_WMDE, and apergos: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T1200). [12:00:16] nothing to deploy [12:00:18] (03CR) 10Arturo Borrero Gonzalez: wmcs: toolforge: add cookbooks to scale the grid with each node type (034 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [12:00:28] no open training tasks either [12:00:31] (03CR) 10Arturo Borrero Gonzalez: wmcs: toolforge: relocate some node-specific cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [12:00:39] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10cmooney) [12:00:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, analytics-admins for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10cmooney) [12:01:30] I'm here, there was one patch which had i18n message additions, I see it's been removed [12:01:40] there are no trainees for the window [12:02:43] 10SRE, 10LDAP-Access-Requests: Grant Access to cn=wmf and cn=ops for Nmaphophe - https://phabricator.wikimedia.org/T298868 (10cmooney) 05Open→03Resolved a:03cmooney Hi @ntsako Thanks for the feedback. I've made some changes now, and processed your other request, so hopefully you can now get in. Be ad... [12:02:44] cormacparle removed the one patch from the window so that's it for today [12:02:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, analytics-admins for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10cmooney) [12:03:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM eventlog1003.eqiad.wmnet [12:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:14] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10BTullis) [12:05:20] (03PS1) 10Cathal Mooney: Add Ntsako Maphophe (ntsako) SSH pubkey and analytic group membership [puppet] - 10https://gerrit.wikimedia.org/r/753711 (https://phabricator.wikimedia.org/T299066) [12:13:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18726 and previous config saved to /var/cache/conftool/dbconfig/20220113-121300-root.json [12:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] (03CR) 10Cathal Mooney: [C: 03+2] Add Ntsako Maphophe (ntsako) SSH pubkey and analytic group membership [puppet] - 10https://gerrit.wikimedia.org/r/753711 (https://phabricator.wikimedia.org/T299066) (owner: 10Cathal Mooney) [12:21:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-corp1001.wikimedia.org [12:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-corp1001.wikimedia.org [12:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cloudbackup1002-dev.eqiad.wmnet [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18727 and previous config saved to /var/cache/conftool/dbconfig/20220113-122803-root.json [12:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudbackup1002-dev.eqiad.wmnet [12:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [12:32:43] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [12:37:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es1021', diff saved to https://phabricator.wikimedia.org/P18728 and previous config saved to /var/cache/conftool/dbconfig/20220113-123744-marostegui.json [12:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [12:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove all special groups from s3 codfw T263127', diff saved to https://phabricator.wikimedia.org/P18729 and previous config saved to /var/cache/conftool/dbconfig/20220113-124140-marostegui.json [12:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:44] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [12:43:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18730 and previous config saved to /var/cache/conftool/dbconfig/20220113-124300-marostegui.json [12:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18731 and previous config saved to /var/cache/conftool/dbconfig/20220113-124307-root.json [12:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:54] (03CR) 10MMandere: [C: 03+2] drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [12:44:41] (03PS8) 10MMandere: drmrs: lvs/cp puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748752 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [12:50:19] (03PS1) 10Cparle: Remove fulltext normalisation in synonyms profile for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753722 (https://phabricator.wikimedia.org/T293106) [12:54:57] (03PS12) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [12:57:33] (03PS13) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) [12:58:00] (03PS1) 10Jbond: bgpalerter: convert rest to a structurd type [puppet] - 10https://gerrit.wikimedia.org/r/753723 [12:58:02] (03PS1) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [12:58:37] (03CR) 10Arturo Borrero Gonzalez: wmcs: relocate start_instance_with_prefix cookbook (035 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [12:59:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users, analytics-admins for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10cmooney) OK @ntsako hopefully we have everything in place now. Can you try to log on to required systems again and advise ho... [12:59:13] (03CR) 10jerkins-bot: [V: 04-1] bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 (owner: 10Jbond) [13:02:42] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:21] (03PS9) 10Hnowlan: partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [13:08:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cloudbackup1001-dev.eqiad.wmnet [13:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cloudbackup1001-dev.eqiad.wmnet [13:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:12:42] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:00] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:00] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:08] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:08] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:10] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:12] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:14] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:30] mmandere: related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/748752/? ^ [13:13:42] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:44] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:46] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:54] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:54] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:13:58] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:14:22] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:14:22] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:14:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:15:28] yes, probably related [13:16:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=varnish-upload site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:16:40] yeah so the actual error is: [13:16:43] Jan 13 13:16:06 puppetmaster2001 confd[26165]: 2022-01-13T13:16:06Z puppetmaster2001 /usr/bin/confd[26165]: ERROR "updating error mtime on /var/run/confd-template/.upload268226911.err\nfailed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/drmrs/.upload268226911' with 1 (0.021153926849365234s) [invalid]: server pool cannot be empty!\n\n" [13:17:02] basically we have new pools defined for drmrs, but I guess they're all empty because we're configuring the first hosts [13:17:23] let's see if it resolves on its own once we finish the first couple [13:20:08] (03PS3) 10Eigyan: [wmf-config] Deploy the cawiki test safety survey to production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753543 (https://phabricator.wikimedia.org/T296657) [13:20:33] (03PS1) 10Jbond: C:puppetmaster: pass the http_proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/753727 [13:20:55] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [13:21:08] (03CR) 10jerkins-bot: [V: 04-1] C:puppetmaster: pass the http_proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/753727 (owner: 10Jbond) [13:21:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd1001.eqiad.wmnet with reason: switch to drbd storage [13:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd1001.eqiad.wmnet with reason: switch to drbd storage [13:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:25] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [13:23:11] !log switch ml-etcd1001 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [13:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:42] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [13:24:21] PROBLEM - Host text-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:22] PROBLEM - Host text-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:24:23] PROBLEM - Host upload-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [13:24:24] PROBLEM - Host upload-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:59] those are not in use, right? [13:25:05] why is drmrs paging [13:25:06] yeah [13:25:08] Safe to assume not production? [13:25:10] please ignore [13:25:10] (03CR) 10David Caro: [C: 03+1] wmcs: relocate start_instance_with_prefix cookbook (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [13:25:10] * Emperor was just coming to see about those pages [13:25:24] XioNoX: you OK to Ack the alerts? [13:25:24] /ignore xionox [13:25:32] 🐟 [13:26:06] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp6001 is CRITICAL: connect to address 10.136.0.6 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:26:26] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:26] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:26] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:26] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:26] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:42] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:48] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:26:49] o.O [13:26:51] <_joe_> ok whtàs going on_ [13:26:56] here [13:27:20] Emperor: leaving it to traffic, I don't think it's network related? [13:27:22] <_joe_> ok what changed for etcd? [13:27:47] we're bringing up the first cp node in drmrs [13:28:01] apparently a lot of things aren't happy about that, but nothing should be actually broken to worry about [13:28:18] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp6001 is CRITICAL: connect to address 10.136.0.6 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:28:49] <_joe_> bblack: I see that the lvs pool for cache upload is currently empty, that's the origin of those errors about confd [13:28:52] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:28:52] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:29:24] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp6001 is OK: HTTP OK: HTTP/1.1 200 OK - 469 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:29:32] bblack: should I ack them? [13:29:40] so it doesn't ping again in 24 hours [13:29:49] downtiming [13:29:50] <_joe_> Amir1: let traffic handle it [13:29:51] (03PS2) 10Jbond: C:puppetmaster: pass the http_proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/753727 [13:29:52] (for a month) [13:30:12] sure, thanks [13:30:12] _joe_: yes, the first node apparently has to puppet before there's something in the pool, so there's an unavoidable time gap there [13:30:45] <_joe_> bblack: you can add it before, but that creates other problems ofc [13:30:54] <_joe_> to etcd I mean [13:31:20] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:31:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:36] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp6001 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:33:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader1001.wikimedia.org [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:37] (03CR) 10Jbond: [C: 03+2] C:puppetmaster: pass the http_proxy parameter [puppet] - 10https://gerrit.wikimedia.org/r/753727 (owner: 10Jbond) [13:33:42] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/text https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:33:55] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:35:51] working on the first text node now, hopefully once we have one of each, the confd errors will go away [13:35:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader1001.wikimedia.org [13:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:10] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text-https on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/text-https https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:37:20] PROBLEM - Host cp6001 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={cadvisor,varnish-upload,varnishkafka} site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:37:44] RECOVERY - Host cp6001 is UP: PING WARNING - Packet loss = 60%, RTA = 86.13 ms [13:38:26] (03CR) 10Elukey: [C: 03+1] Fail back the hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/753709 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [13:38:30] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/upload https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:38:33] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:39:45] (03PS1) 10Muehlenhoff: Failover urldownloader in eqiad to 1001 [dns] - 10https://gerrit.wikimedia.org/r/753730 [13:40:34] I just acked the drmrs alerts, can we resolve them? [13:40:37] (03PS1) 10Ayounsi: LibreNMS report, skip devices with no IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 [13:40:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:40:53] <_joe_> moritzm: let me check one thing before you failover urldownloader [13:40:56] they will auto resolve once reachable I think [13:41:02] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:41:02] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload-https on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/upload-https https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:41:28] XioNoX: they won't be reachable anytime soon (routers, for pybal bgp) [13:41:33] for the site-level public addrs that paged [13:42:13] so yeah, those should probably just be resolved at the paging level. they're downtimed for a month at the icinga level now. [13:42:17] _joe_: ack! [13:42:20] (they didn't exist before they paged, sorry!) [13:42:22] <_joe_> moritzm: sorry, just did a more thorough check, you can go on [13:42:28] ah, right! [13:42:31] ok [13:42:31] thx! [13:43:28] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:43:28] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:44:14] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [13:44:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={trafficserver,trafficserver-text,varnish-text} site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:52] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:45:57] !log mmandere@cumin1001 conftool action : set/pooled=yes; selector: name=cp6001.drmrs.wmnet [13:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd1002.eqiad.wmnet with reason: switch to drbd storage [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd1002.eqiad.wmnet with reason: switch to drbd storage [13:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:14] !log switch ml-etcd1002 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [13:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] (03CR) 10Muehlenhoff: [C: 03+2] Failover urldownloader in eqiad to 1001 [dns] - 10https://gerrit.wikimedia.org/r/753730 (owner: 10Muehlenhoff) [13:50:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:50:46] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:52:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:52:26] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:32] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:45] !log mmandere@cumin1001 conftool action : set/pooled=yes; selector: name=cp6009.drmrs.wmnet [13:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:36] (03PS1) 10Jbond: hieradata - add cloud-puppetmaster-03 cert for upload [puppet] - 10https://gerrit.wikimedia.org/r/753732 [13:54:40] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:55:03] (03CR) 10Jbond: [C: 03+2] hieradata - add cloud-puppetmaster-03 cert for upload [puppet] - 10https://gerrit.wikimedia.org/r/753732 (owner: 10Jbond) [13:55:30] PROBLEM - Host cp6009 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:55] the confd template issues seem to be slowly dissappearing now in icinga, but I never saw the recoveries here, I think [13:56:05] let me re-run them all [13:56:18] RECOVERY - Host cp6009 is UP: PING WARNING - Packet loss = 77%, RTA = 86.11 ms [13:59:08] yeah there's a little more to it. I think it resolved the drmrs ones when I pooled one drmrs node of each type [13:59:25] but didn't trigger clearing the other-site errors (e.g. ulsfo template failing because of missing drmrs nodes) [14:00:44] <_joe_> bblack: not sure why they're firing, lemme take a look [14:01:29] well for whatever reason, e.g. the drmrs-upload problem also breaks template compilation for e.g. esams-upload (kind of understandable) [14:01:32] <_joe_> ok, first thing: confd is ok with all the templates [14:01:49] and then yeah, I don't see any ongoing errors in the logs, but icinga state is stuck [14:01:56] probably no mtime updates on the ones that fixed [14:02:00] <_joe_> it's the check [14:02:08] <_joe_> not an actual error [14:05:10] RECOVERY - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:10] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:11] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:11] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:11] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:11] RECOVERY - Confd template for /srv/config-master/pybal/esams/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:11] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:12] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:12] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:13] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:13] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:14] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:14] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:15] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:15] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:16] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:17] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:05:22] <_joe_> yes so, the problem is that the error files have the same name for all sites [14:05:28] <_joe_> "upload" and "text" [14:05:47] <_joe_> so there is no way to distinguish them properly, I guess [14:05:54] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:06:00] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:06:04] thanks for fixing it up! I was still reading through the nagios check code heh [14:06:24] <_joe_> dmrs was fixed because the actual file was recreated after the errors [14:06:53] <_joe_> there were a lot of files like .upload705952794.err [14:07:14] <_joe_> ok at least we know what breaks [14:09:13] _joe_: where do those err files end up at? [14:09:33] ah I found them [14:09:43] /var/run/confd-template [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:03] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:03] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:04] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:05] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:05] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:06] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:06] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:07] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:30] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:36] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:36] <_joe_> bblack: /var/run/confd [14:10:41] (03PS3) 10Minato826: Enable ArticlePlaceholder on dagwiki Bug: T298349 Change-Id: Ibcc542a16127bda4cbc007e127095f2070a4d9b8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) [14:10:46] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:48] <_joe_> heh you found it [14:10:52] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:10:56] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:11:00] RECOVERY - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:11:08] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:14:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd1003.eqiad.wmnet with reason: switch to drbd storage [14:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd1003.eqiad.wmnet with reason: switch to drbd storage [14:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:11] (03PS4) 10Minato826: Enable ArticlePlaceholder on dagwiki Bug: T298349 Change-Id: Ibcc542a16127bda4cbc007e127095f2070a4d9b8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) [14:15:14] !log switch ml-etcd1003 to DRBD (needed to be able to shuffle instances around for the Ganeti buster update) [14:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:19:22] (03CR) 10Hnowlan: [C: 03+2] partman: add reuse partman profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [14:23:26] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:24:09] (03PS2) 10Btullis: Fail back the hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/753709 (https://phabricator.wikimedia.org/T297468) [14:26:41] (03PS5) 10Minato826: Enable ArticlePlaceholder on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) [14:32:58] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:36:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:39:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM idp1001.wikimedia.org [14:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, analytics-admins for Ntsako Maphophe - https://phabricator.wikimedia.org/T299066 (10cmooney) 05Open→03Resolved a:03cmooney [14:43:59] (03PS6) 10Minato826: Enable ArticlePlaceholder on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) [14:46:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:47:05] !log systemctl reset-failed ifup@ens5.service on idp1001 T273026 [14:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:09] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [14:47:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1001.wikimedia.org [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 396 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:49:38] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) a:05Papaul→03bking [14:49:47] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) 05Resolved→03In progress [14:53:14] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10cmooney) 05Open→03Resolved Confirmed working. Closing task. [14:55:24] (03PS2) 10Joal: Add network_internal_flows to refine and druid-load [puppet] - 10https://gerrit.wikimedia.org/r/748097 (https://phabricator.wikimedia.org/T263277) [14:55:38] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) Per yesterday's conversation with @Gehel (and Moritz's suggestion above) , we have elected to reimage this server to Stretch and deal with the Bullseye issues sepa... [14:55:50] ottomata: I just sent a puppet patch for refine and hive-to-druid - https://gerrit.wikimedia.org/r/c/operations/puppet/+/748097/ [14:56:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [14:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:38] ottomata: everything else is in place, so we should be ready to merge that when you wish [14:56:41] !log cp3053: upgrade varnish to 6.0.9-1wm1 T298758 [14:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:44] T298758: Package and deploy Varnish 6.0.9 - https://phabricator.wikimedia.org/T298758 [14:57:29] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Michael Große (WMDE) - https://phabricator.wikimedia.org/T269610 (10Michael) Thank you here as well 🙂 [15:01:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: factorized node creation cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753017 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [15:01:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: add cookbooks to scale the grid with each node type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753006 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [15:01:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: relocate some node-specific cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753027 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [15:01:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: relocate start_instance_with_prefix cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753018 (https://phabricator.wikimedia.org/T298948) (owner: 10Arturo Borrero Gonzalez) [15:03:11] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:03:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM urldownloader1002.wikimedia.org [15:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM urldownloader1002.wikimedia.org [15:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:33] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch [15:11:07] (03PS1) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [15:11:40] (03CR) 10jerkins-bot: [V: 04-1] kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:12:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33225/console" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:14:16] (03CR) 10Ayounsi: bgpalerter: Add email alerting and tweek default config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753445 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:15:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM seaborgium.wikimedia.org [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM seaborgium.wikimedia.org [15:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1003.wikimedia.org [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:02] (03CR) 10Ottomata: kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:21:17] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase2009.codfw.wmnet with OS buster [15:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1003.wikimedia.org [15:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [15:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:35] (03CR) 10Ayounsi: "I think we're going in the good direction but we need to check with the Analytics team on the overall approach, maybe on the list of hosts" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:25:48] (03CR) 10Elukey: [V: 03+1] kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:26:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ldap-replica1004.wikimedia.org [15:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:57] (03PS2) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [15:27:00] (03CR) 10Ayounsi: bgpalerter: add new class to configure bgpalerter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753102 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:28:19] (03CR) 10Ayounsi: [C: 03+1] profile::installserver::proxy: update squid template [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [15:28:31] (03PS3) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [15:28:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ldap-replica1004.wikimedia.org [15:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:51] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8234221576 and 58443 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:29:09] is gerrit super slow only for me? [15:30:11] (03CR) 10Elukey: kafka: add check to test the Broker's TLS port (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:30:27] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10jhathaway) @MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do some short reboots into the new kernel to gather mo... [15:31:49] (03CR) 10Herron: [C: 03+2] kafka-logging: move to fixed UID/GID for kafka user [puppet] - 10https://gerrit.wikimedia.org/r/752677 (https://phabricator.wikimedia.org/T298883) (owner: 10Herron) [15:33:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:34:08] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Patch-For-Review: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) On mwdebug1002 I have set the excimer time limit (in mediawiki's code), the envoy timeout, the apache timeout, the php-fpm request_terminate_... [15:34:19] PROBLEM - Check systemd state on kafka-logging2003 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:00] <_joe_> cwhite, herron ^^ [15:35:20] _joe_: that's me I'll downtime [15:35:27] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 136 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:35:28] (03CR) 10Ottomata: [C: 03+1] "Couple more nits but +1. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [15:35:46] RECOVERY - Check systemd state on kafka-logging2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:51] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [15:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:58] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [15:37:32] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10MoritzMuehlenhoff) >>! In T299107#7619785, @jhathaway wrote: > @MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do... [15:38:46] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:40:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM flowspec1001.eqiad.wmnet [15:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:23] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) More details on failure: `Exception raised while executing cookbook sre.hosts.reimage: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spi... [15:42:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM flowspec1001.eqiad.wmnet [15:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [15:44:08] 10ops-eqiad, 10decommission-hardware, 10Kubernetes: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10Arnoldokoth) a:05wiki_willy→03None [15:44:15] (03CR) 10Ayounsi: O:rpkivalidator: add bgpalerter to rpki servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:44:33] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10jhathaway) great, I'll report back what I find. [15:47:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM aphlict1001.eqiad.wmnet [15:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aphlict1001.eqiad.wmnet [15:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:21] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:36] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [15:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:44] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [15:51:10] (03PS1) 10Cwhite: logstash: rename user field first [puppet] - 10https://gerrit.wikimedia.org/r/753767 [15:53:20] (03PS1) 10Arturo Borrero Gonzalez: wmcs: move grid-dedicated code to its own package [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 [15:54:02] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [15:56:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs: move grid-dedicated code to its own package [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [15:57:29] (03CR) 10Cwhite: [C: 03+2] logstash: rename user field first [puppet] - 10https://gerrit.wikimedia.org/r/753767 (owner: 10Cwhite) [15:57:36] (03PS2) 10Cwhite: logstash: rename user field first [puppet] - 10https://gerrit.wikimedia.org/r/753767 [15:57:45] (03PS2) 10Arturo Borrero Gonzalez: wmcs: move grid-dedicated code to its own package [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 [16:00:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cuminunpriv1001.eqiad.wmnet [16:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:43] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [16:01:07] (03CR) 10Btullis: [C: 03+2] Fail back the hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/753709 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [16:01:10] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:01:15] (03CR) 10Herron: [C: 03+1] logstash: rename user field first [puppet] - 10https://gerrit.wikimedia.org/r/753767 (owner: 10Cwhite) [16:02:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cuminunpriv1001.eqiad.wmnet [16:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:53] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:05:10] (03PS4) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [16:05:57] (03CR) 10Elukey: "Thanks for the review Andrew!" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [16:06:26] (03CR) 10Jbond: [C: 03+1] "Seems reasonable to me" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/753731 (owner: 10Ayounsi) [16:07:38] (03CR) 10Jbond: [C: 03+2] bgpalerter: convert rest to a structurd type [puppet] - 10https://gerrit.wikimedia.org/r/753723 (owner: 10Jbond) [16:12:47] (03CR) 10Elukey: kafka: add check to test the Broker's TLS port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [16:15:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33227/console" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:27:09] (03PS6) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [16:27:22] !log impor maps-deduped-tilelist 0.0.5 to buster-wikimedia/main T297408 [16:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:26] T297408: Install latest package for maps-deduped-tilelist (v0.0.4) - https://phabricator.wikimedia.org/T297408 [16:27:49] (03CR) 10David Caro: wmcs: move grid-dedicated code to its own package (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [16:28:29] (03CR) 10Jbond: [V: 03+1] "Added analytics to validate the white list" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:29:17] (03CR) 10Jbond: "ill wait until next week to deploy this" [puppet] - 10https://gerrit.wikimedia.org/r/753016 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:30:03] (03PS7) 10Jbond: P:installserver::proxy: Add domain whitelist to proxy [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) [16:30:06] (03CR) 10David Caro: [C: 03+2] osm::usergrants: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751162 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:31:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:31:45] (03CR) 10Ayounsi: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [16:34:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [16:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:50] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [16:35:07] (03PS9) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [16:36:20] (03PS1) 10Ahmon Dancy: beta::autoupdater: Don't manipulate ${stage_dir}/php-master/LocalSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/753780 [16:36:44] (03PS2) 10Ahmon Dancy: beta::autoupdater: Don't manipulate ${stage_dir}/php-master/LocalSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/753780 [16:38:07] thanks herron for the move of kafka logging to fixed uid! [16:38:19] all clusters done now \o/ [16:38:39] woo woo! [16:38:44] np elukey [16:39:34] \o/ [16:40:46] (03PS10) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [16:41:51] (03PS2) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:43:20] (03PS3) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:47:08] (03PS1) 10Hnowlan: partman: remove reuse-test from restbase2009, use linux-swap [puppet] - 10https://gerrit.wikimedia.org/r/753781 (https://phabricator.wikimedia.org/T295375) [16:47:13] (03PS4) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:50:59] (03PS5) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:56:27] (03PS6) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:57:36] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [16:58:16] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10wiki_willy) a:03Cmjohnson [16:58:52] (03PS7) 10Jbond: bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 [16:58:54] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Kubernetes: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10wiki_willy) a:03Cmjohnson [17:00:04] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T1700). Please do the needful. [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:01:36] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [17:01:36] (03CR) 10Jbond: [C: 03+2] bgpalerter: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753724 (owner: 10Jbond) [17:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:20] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [17:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:22] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [17:09:48] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:52] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 312498560 and 642 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:10:56] (03PS1) 10Jbond: P:admin: pass through always group [puppet] - 10https://gerrit.wikimedia.org/r/753783 [17:11:08] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [17:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:16] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [17:11:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33228/console" [puppet] - 10https://gerrit.wikimedia.org/r/753783 (owner: 10Jbond) [17:12:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:admin: pass through always group [puppet] - 10https://gerrit.wikimedia.org/r/753783 (owner: 10Jbond) [17:14:52] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1888262712 and 941 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:16:34] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2115316816 and 1044 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:17:18] (03PS1) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [17:18:36] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3276802312 and 1165 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:18:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33230/console" [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond) [17:20:01] (03PS3) 10Ahmon Dancy: beta::autoupdater: Don't manipulate ${stage_dir}/php-master/LocalSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/753780 [17:20:03] (03PS1) 10Ahmon Dancy: beta::autoupdater: Remove more obsolete stuff after scap prep auto [puppet] - 10https://gerrit.wikimedia.org/r/753787 [17:22:05] (03PS1) 10DCausse: rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) [17:22:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:17] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [17:22:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [17:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:42] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [17:22:55] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7599815016 and 1370 seconds Hnowlan Replication broken by planet sync, resyncing. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:22:55] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4045681488 and 1222 seconds Hnowlan Replication broken by planet sync, resyncing. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:22:55] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5423366184 and 1279 seconds Hnowlan Replication broken by planet sync, resyncing. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:22:55] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7166381024 and 1385 seconds Hnowlan Replication broken by planet sync, resyncing. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:22:55] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6583343456 and 1370 seconds Hnowlan Replication broken by planet sync, resyncing. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:27:18] (03PS2) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [17:28:14] (03CR) 10Jbond: [C: 03+2] beta::autoupdater: Don't manipulate ${stage_dir}/php-master/LocalSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/753780 (owner: 10Ahmon Dancy) [17:28:31] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:28:31] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [17:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:18] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:29:18] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [17:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:41] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:59] (03PS3) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [17:33:11] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [17:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:41] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:49] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [17:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decommission kubestage100[12]-eqiad - https://phabricator.wikimedia.org/T299142 (10RobH) [17:42:47] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10bking) Looks like the server is trying to PXE boot from its 1 GB NICs, but it should be using its 10GB NICs. Guessing this can be fixed through the BIOS based on papaul's... [17:45:03] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1005.eqiad.wmnet with reason: requires resync after planet sync [17:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1005.eqiad.wmnet with reason: requires resync after planet sync [17:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:06] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10Dzahn) >>! In T299107#7618862, @MoritzMuehlenhoff wrote: > Did you by chance see whether the "Check size of conntrack table" Icinga check alerted? I checked Icinga, nothin... [17:48:15] (03CR) 10Ottomata: [C: 03+2] Add network_internal_flows to refine and druid-load [puppet] - 10https://gerrit.wikimedia.org/r/748097 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [17:48:35] (03CR) 10Dzahn: [C: 03+2] beta::autoupdater: Don't manipulate ${stage_dir}/php-master/LocalSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/753780 (owner: 10Ahmon Dancy) [17:48:54] (03CR) 10Dzahn: "when you hit +2 and it's already merged :)" [puppet] - 10https://gerrit.wikimedia.org/r/753780 (owner: 10Ahmon Dancy) [17:49:34] Thanks mutante: I just added another to the mix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/753787/1 [17:49:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [17:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:07] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [17:50:25] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) [17:50:29] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10Dzahn) [17:52:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [17:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:43] 10SRE-swift-storage: swift-repl failing due to auth failures - https://phabricator.wikimedia.org/T299122 (10Dzahn) Cool, thanks for this and already creating T299125 [17:52:45] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [17:58:54] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T1800). Please do the needful. [18:02:22] (03PS1) 10Jdlrobson: Enable skin migration mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753793 [18:02:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10Dzahn) Hey @mfossati, re: integration.wikmedia.org - I would say it's expected that you get "denied" for that specific URL (the configureSecurity part), I do too. But you should not see an... [18:04:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:04:40] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1598024 and 3929 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:05:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:40] (03PS1) 10Joal: Fix error in network_internal_flows druid job [puppet] - 10https://gerrit.wikimedia.org/r/753794 (https://phabricator.wikimedia.org/T263277) [18:30:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:32:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:32:30] (03PS4) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [18:32:58] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) @ayounsi i have finished cabling everything up in new cage let me know if you have any issues [18:34:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Cmjohnson) [18:34:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_internal_flows_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:44] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) {F34917288} [18:40:02] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:41:06] (03PS3) 10Cwhite: logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 [18:42:26] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:48] (03PS5) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [18:47:41] (03CR) 10jerkins-bot: [V: 04-1] P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond) [18:49:33] (03PS6) 10Jbond: P:admin: switch to using wmflib::dump_params [puppet] - 10https://gerrit.wikimedia.org/r/753786 [18:50:16] (03CR) 10Ottomata: [C: 03+2] Fix error in network_internal_flows druid job [puppet] - 10https://gerrit.wikimedia.org/r/753794 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [18:54:40] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:54:48] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:54:56] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:55:22] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:55:30] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:55:52] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:56:09] 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) [18:56:12] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The following units failed: smartd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:13] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:56:18] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:56:22] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:59:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:04] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T1900). [19:00:04] sharvani_, sharvani_, anoop, and Jdlrobson: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:34] hello [19:01:02] I can deploy today if any of our patch authors are around [19:01:05] hi taavi [19:01:14] (no patch, just saying hi) [19:01:24] I'm around [19:01:45] Mine is a beta cluster only patch [19:01:54] (03PS2) 10Majavah: Enable skin migration mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753793 (owner: 10Jdlrobson) [19:02:05] hauskatze: lol, I was very confused for a second :P [19:02:34] (03CR) 10Majavah: [C: 03+2] Enable skin migration mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753793 (owner: 10Jdlrobson) [19:03:29] (03Merged) 10jenkins-bot: Enable skin migration mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753793 (owner: 10Jdlrobson) [19:04:49] Jdlrobson: since the patch touches the non-beta IS.php I need to sync it out to prod (just in case so it doesn't cause any suprises on the next sync) [19:04:55] Yep sounds good [19:05:09] can it be tested in any other way than "the site loads using Vector"? [19:05:11] I can also verify that it didn't break anything in production on the debug server [19:05:21] I can run a test [19:05:27] it's basically looking at Special:Preferences [19:05:33] and checking nothing changed in the appearance tab [19:05:36] yes please, pulled to mwdebug1001 [19:05:49] taavi: looking! [19:06:19] taavi: LGTM [19:06:21] feel free to sync [19:06:39] thanks, syncing [19:07:57] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753793|Enable skin migration mode on the beta cluster]] (duration: 01m 14s) [19:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:35] the beta cluster specific change should deploy automatically to beta within the next 30 mins or so, please ping me if it doesn't [19:08:38] anything else for you? [19:09:26] @taavi Will you be deploying the 2 patches I have for this window as well? [19:09:33] sharvani_: yeah, was just getting to it [19:09:46] can your patches be tested on a mwdebug server? [19:09:50] Thank you ... I am here to test when you do it.. [19:09:52] yes [19:09:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:19] ack, I'll ping you when the first one is available for testing [19:10:25] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:10:41] (03PS4) 10Majavah: Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan) [19:10:46] (03CR) 10Majavah: [C: 03+2] Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan) [19:10:58] Thanks taavi [19:10:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:24] (03Merged) 10jenkins-bot: Add event stream config for android.customize_toolbar_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747991 (https://phabricator.wikimedia.org/T297818) (owner: 10Sharvaniharan) [19:13:00] sharvani_: ok, the first patch is now available for testing on mwdebug1001 [19:13:28] Tested android.customize_toolbar_interaction [mediawiki-config]. Looking all good! thank you.. [19:14:03] thanks, syncing [19:15:17] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747991|Add event stream config for android.customize_toolbar_interaction (T297818)]] (duration: 01m 12s) [19:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:21] T297818: Create MEP schema for Cutomization of toolbar and wire-up the app for it. - https://phabricator.wikimedia.org/T297818 [19:15:34] (03PS4) 10Majavah: Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [19:15:54] (03CR) 10Majavah: [C: 03+2] Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [19:17:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:23] (03Merged) 10jenkins-bot: Add event stream config for ios.notification_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747993 (https://phabricator.wikimedia.org/T290920) (owner: 10Sharvaniharan) [19:18:01] sharvani_: pulled the ios patch to mwdebug1001 too, can you test please? [19:18:23] Tested os.notification_interaction [mediawiki-config] and this is also looking good. Thank you! [19:18:41] syncing that one too [19:18:53] anoop: hi, around? your patches are up next [19:19:17] yeah I am around [19:19:27] (03PS7) 10Majavah: Enable ArticlePlaceholder on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) (owner: 10Minato826) [19:19:30] Thank you for deploying @taavi! [19:19:44] you're welcome [19:19:50] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:747993|Add event stream config for ios.notification_interaction (T290920)]] (duration: 01m 13s) [19:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:54] T290920: Create schema to track metrics for user notifications on iOS - ios_notification_interaction - https://phabricator.wikimedia.org/T290920 [19:20:23] (03CR) 10Majavah: [C: 03+2] Enable ArticlePlaceholder on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) (owner: 10Minato826) [19:21:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:21:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:46] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753634 (https://phabricator.wikimedia.org/T298349) (owner: 10Minato826) [19:22:34] anoop: your patch is now available for testing on mwdebug1001 [19:22:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:02] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply on main [19:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:24] ok, thank you for deploying @taavi [19:25:39] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: sync on main [19:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:08] anoop: to clarify, your patch is now only on a single debug server where you need to test it with the x-wikimedia-debug browser extension that it works properly before it's deployed to all the servers. are you familiar with the process or can I help somehow? [19:27:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:29:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:13] ArticlePlaceholder appears to be enabled on dag.wikipedia.org, though the on-wiki setup still needs to be performed [19:30:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:40] AntiComposite: yeah I see the extension is loaded, but I don't know what it's supposed to do (so can't really tell if it's horribly broken or not) [19:33:21] ain't it a Wikidata thing? ArticlePlaceholder? [19:33:37] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10Platonides) @MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ? Apparently upstream identified the issue as 09e856d54bda5f288ef8437a90ab2b9... [19:34:11] yeah, it shows Wikidata data when pages don't exist [19:34:35] there's some on-wiki setup that has to be performed before it will work properly https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder#Set-up [19:35:34] /away [19:35:54] I think it adds https://dag.wikipedia.org/wiki/Di%C5%8B'gahim:AboutTopic?entityid=Q42 (which seems to be as expected as the on-wiki setup hasn't done yet) [19:36:45] yes, it adds that special page and modifies the Special:Search results [19:37:12] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply on main [19:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:18] (03PS8) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) [19:38:23] the Special:Search page looks as expected compared to https://eo.wikipedia.org/w/index.php?search=magnus+manske&title=Speciala%C4%B5o%3ASer%C4%89i&go=Go&ns0=1&safemode=1 [19:39:09] would of course be best to hear from anoop that everything is as they expect [19:39:10] thanks, that makes me reasonably confident it's safe enough to sync [19:39:28] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10jhathaway) @Platonides that revert made it into 5.10.78, https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.78, so I don't believe that is the issue, since we were... [19:39:52] should probably note somewhere that the on-wiki setup should be completed before the extension is enabled [19:40:41] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:753634|Enable ArticlePlaceholder on dagwiki (T298349)]] (duration: 01m 13s) [19:40:43] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: sync on main [19:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:44] T298349: Enable ArticlePlaceholder on dagwiki - https://phabricator.wikimedia.org/T298349 [19:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:06] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:24] (or even better, a special page should not rely on the existence of a template and lua module to produce useful output) [19:42:00] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2051.codfw.wmnet with OS stretch [19:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:08] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [19:42:16] I'll leave a note about the on-wiki setup on the phab task [19:43:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:43:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:43:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [19:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:37] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [19:43:40] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:52] AntiComposite: thanks for your help! [19:43:58] anyone has anything else to deploy? [19:46:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:06] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) Since the latest deploy now content can be fetched from staging, codfw and eqiad, it is all gzipped inside the image, which should be much smaller now and content is no... [19:50:54] taavi: are you still around? [19:50:59] (03PS1) 10Jdlrobson: Restore use of useskin=vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753811 [19:51:43] My change rolled out to beta cluster and I realised there is a problem with it, it's broken https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?useskin=vector no longer works [19:51:53] Jdlrobson: yes (although I was just about to leave) [19:52:30] (03PS2) 10Majavah: Restore use of useskin=vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753811 (owner: 10Jdlrobson) [19:52:44] (03CR) 10Majavah: [C: 03+2] Restore use of useskin=vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753811 (owner: 10Jdlrobson) [19:52:45] @taavi it's beta cluster only [19:52:49] so I think we can just sync it? [19:53:31] (03Merged) 10jenkins-bot: Restore use of useskin=vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753811 (owner: 10Jdlrobson) [19:53:37] I can just pull it to the production deployment host (so that the next person deploying does not get confused) and it'll get auto-deployed to deployment-prep [19:54:01] no need to sync it, since it doesn't touch any file that production servers read [19:54:26] taavi: as the next deployer up, i appreciate that [19:54:53] we're rolling group0/group1 and maybe group2 today :/ [19:54:59] Jdlrobson: done, the patch should get deployed to beta withing the next 30 mins or so [19:55:07] thanks taavi and sorry to pull you back [19:55:11] dduvall: I'm done with deployments, best of luck with the train [19:55:23] thank you very much, taavi [19:56:08] (I have a long-requested new feature riding this train and can't wait to enable it when the train is stable :D) [19:59:27] oh nice :) [20:00:04] dduvall and twentyafterfour: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T2000). Please do the needful. [20:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:14] okey dokey. let's roll this thing [20:00:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:21] twentyafterfour: o/ [20:00:30] :) [20:01:03] I can help watch logstash [20:01:12] let's keep a close eye on the slow queries today. look out for INSERTs into pagelinks and templatelinks [20:01:28] shouldn't be a problem with that fix merged, but if we see it it's immediate rollback [20:01:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:01:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:43] (03PS1) 10Dduvall: group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753812 [20:03:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_network_internal_flows_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:45] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753812 (owner: 10Dduvall) [20:04:37] the new-errors dashboard is nice and clean. that's super helpful [20:05:19] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753812 (owner: 10Dduvall) [20:06:47] dduvall: we did log triage just a bit ago [20:07:11] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.17 refs T293958 [20:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:14] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:07:25] nice. that's the Lord's work right there [20:07:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:07:48] log triage that is [20:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:57] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:15] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:10:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:09] (03PS1) 10Dzahn: trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb [puppet] - 10https://gerrit.wikimedia.org/r/753813 (https://phabricator.wikimedia.org/T281538) [20:13:28] group0 looks alright so far [20:14:16] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753814 [20:14:18] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753814 (owner: 10Dduvall) [20:14:24] twentyafterfour: rolling to group1 now [20:14:44] cool [20:14:59] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753814 (owner: 10Dduvall) [20:16:47] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17 refs T293958 [20:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:51] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:17:54] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17 refs T293958 (duration: 01m 06s) [20:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:04] everything still looks good [20:20:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:56] twentyafterfour: i'm seeing an increase of https://phabricator.wikimedia.org/T299149 [20:21:40] hmm why am I not seeing that at all [20:23:22] are you on mediawiki-new-errors? those are filtered there [20:24:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:24:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:34] yeah, they're filtered on mediawiki-new-errors, but the rate increase for commons/he wp/it wp is quite high [20:27:01] i don't feel comfortable promoting all wikis until we get more feedback on the task [20:28:59] !log rolling back wmf.17 from group1 due to a large increase in "Parser state cleared while parsing" across commons and group1 wikipedias (T293958, T299149) [20:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:03] T299149: MWException: Parser state cleared while parsing. Did you call Parser::parse recursively? - https://phabricator.wikimedia.org/T299149 [20:29:04] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [20:31:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:29] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.17" [20:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:13] (03PS1) 10Dduvall: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753816 [20:32:16] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753816 (owner: 10Dduvall) [20:32:51] (03CR) 10Dzahn: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:32:58] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.17 refs T293958" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753816 (owner: 10Dduvall) [20:37:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:37:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:39] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [20:52:56] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) Hi @KartikMistry re: the question how to get the key to us: you can make a new file in your home dir on the deployment server, deploy1002, ch... [20:53:12] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10hashar) 05Open→03Stalled I am marking this one stalled since that was to investigate why the hos... [20:53:15] (03PS1) 10Joal: Absent network_flows_internal druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/753818 (https://phabricator.wikimedia.org/T263277) [20:53:51] (03CR) 10jerkins-bot: [V: 04-1] Absent network_flows_internal druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/753818 (https://phabricator.wikimedia.org/T263277) (owner: 10Joal) [20:55:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:55:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:50] (03CR) 10Ottomata: kafka: add check to test the Broker's TLS port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [20:58:08] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) @Papaul wrote: > The IDRAC on this server needs reset. Please c... [20:59:55] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [21:00:46] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) There is now the new procurement ticket T299081 [21:01:17] (03CR) 10Dzahn: [C: 03+2] service/miscweb: switch state from monitoring_setup to production [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:02:05] (03CR) 10Dzahn: [C: 03+2] "checked what this will be adding, authdns1001 exports new check to alert1001" [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:02:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:58] (03CR) 10Herron: [C: 03+1] logstash: ensure dlq directory exists [puppet] - 10https://gerrit.wikimedia.org/r/753571 (owner: 10Cwhite) [21:03:51] (03CR) 10Dzahn: "[authdns1001:/etc/nagios/nrpe.d] $ /usr/local/lib/nagios/plugins/check_confd_template '/var/lib/gdnsd/discovery-miscweb.state'" [puppet] - 10https://gerrit.wikimedia.org/r/694630 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:10:36] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) after switching service state to production and running puppet on authdns1001 and alert1001 the new monitoring appeared in Icinga could confirm manually too and schedul... [21:11:44] 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) Upon further review, the node is banned from both clusters already. Closing... [21:14:44] (03CR) 10Ahmon Dancy: [C: 04-1] "holding for now." [puppet] - 10https://gerrit.wikimedia.org/r/753787 (owner: 10Ahmon Dancy) [21:15:30] 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) 05Open→03Resolved [21:20:57] (03CR) 10Dzahn: "exactly like I005cc0bbf25558a and service state is production now. verified icinga checks created and green on DNS servers" [dns] - 10https://gerrit.wikimedia.org/r/693968 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:24:18] (03PS3) 10Dzahn: Add discovery DNS for miscweb [dns] - 10https://gerrit.wikimedia.org/r/693968 (https://phabricator.wikimedia.org/T281538) [21:26:16] (03CR) 10Dzahn: [C: 03+2] Add discovery DNS for miscweb [dns] - 10https://gerrit.wikimedia.org/r/693968 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:28:33] (03CR) 10Dzahn: "[cumin1001:~] $ confctl --object-type discovery select 'dnsdisc=miscweb' get" [dns] - 10https://gerrit.wikimedia.org/r/693968 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:30:55] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) Discovery DNS added! :) Now this works, too: ` [deploy1002:~] $ curl --compressed https://miscweb.discovery.wmnet:4111/bug10023.html -I ` [21:33:24] jouncebot: nowandnext [21:33:24] For the next 0 hour(s) and 26 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220113T2000) [21:33:24] In 2 hour(s) and 26 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0000) [21:33:38] train is rolled back fyi [21:33:55] twentyafterfour: does that mean i can roll a beta-only config out now? [21:34:08] (happy to wait if not, it's not urgent at all) [21:34:23] I think so, dduvall, you aren't working on train currently are you? [21:35:33] urbanecm: I think it's clear [21:35:39] thanks [21:35:43] (03PS2) 10Urbanecm: wgGEMentorDashboardDeploymentMode should be alpha in all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753119 (https://phabricator.wikimedia.org/T298993) [21:35:48] (03CR) 10Urbanecm: [C: 03+2] wgGEMentorDashboardDeploymentMode should be alpha in all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753119 (https://phabricator.wikimedia.org/T298993) (owner: 10Urbanecm) [21:36:05] will be quick, just CI to merge, more or less :) [21:36:36] (03Merged) 10jenkins-bot: wgGEMentorDashboardDeploymentMode should be alpha in all of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753119 (https://phabricator.wikimedia.org/T298993) (owner: 10Urbanecm) [21:37:01] * urbanecm done [21:37:03] thanks again twentyafterfour [21:37:46] (03PS2) 10Joal: Absent network_flows_internal druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/753818 (https://phabricator.wikimedia.org/T263277) [21:39:22] twentyafterfour: holding the train for now [21:40:30] twentyafterfour: so wmf.17 is still blocked right? [21:40:39] yep [21:42:06] ok thanks :) [21:42:29] (03PS2) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [21:42:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:04] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ curl --compressed https://miscweb.discovery.wmnet:4111/bug10023.html -I" [puppet] - 10https://gerrit.wikimedia.org/r/753813 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [21:44:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:44:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:51] 10SRE, 10Analytics-Radar, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762 (10nshahquinn-wmf) 05Resolved→03Declined This was declined rather than resolved. [21:46:32] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [21:47:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:24] !log running puppet on cp-ulsfo [21:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:59] (03PS3) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [21:57:06] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:00:15] !log dzahn@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=miscweb [22:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:04] (03PS4) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [22:03:06] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:05] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:06:02] (03PS1) 10Dzahn: add miscweb to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/753846 (https://phabricator.wikimedia.org/T281538) [22:06:59] (03PS1) 10Dzahn: Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/753827 [22:08:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/753827 (owner: 10Dzahn) [22:13:32] (03PS5) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [22:13:34] (03PS2) 10Dzahn: Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/753827 [22:14:56] (03CR) 10Dzahn: [C: 03+2] Revert "trafficserver: switch static-bugzilla from ganeti-miscweb to k8s-miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/753827 (owner: 10Dzahn) [22:16:23] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:18:54] !log dzahn@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=miscweb [22:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:47] (03PS6) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [22:22:31] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:23:18] (03PS7) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [22:26:28] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:27:40] PROBLEM - DPKG on elastic2051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:49:58] (03PS1) 10Bking: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753851 (https://phabricator.wikimedia.org/T299177) [22:51:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753851 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [22:53:32] (03PS8) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [23:08:24] (03PS1) 10Bking: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) [23:08:58] (03PS1) 10Dduvall: In WikitextContentHandler always use getFreshParser() [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753828 (https://phabricator.wikimedia.org/T299149) [23:11:48] TimStarling, dduvall I +2ed the patch, but it is stuck behind many other patches in zuul gate-and-submit .. so might be a while before it merges unless there is a way to reprioritize those merges. [23:13:59] subbu: thanks for reviewing that, and thanks for the patch TimStarling [23:15:46] it looks like gate-and-submit cleared out a bit just now [23:17:05] * James_F Not for a good reason (spurious CI failures), but yes. [23:17:47] * James_F Eurgh, also zuul-cloner is a bit brain-dead and re-scheduled something that it's already merged. [23:18:15] But it's faster to leave it be to finish the run with the merged VE patch rather than cancel it to 'save' time, as it'd trigger a full reschedule of the whole pipeline. [23:42:50] (03CR) 10Dduvall: [C: 03+2] In WikitextContentHandler always use getFreshParser() [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753828 (https://phabricator.wikimedia.org/T299149) (owner: 10Dduvall) [23:43:51] i'm going ahead with cr+2 on the backport just to speed things up [23:43:58] jouncebot: now [23:43:58] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [23:52:36] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:33] jouncebot: next [23:55:33] In 0 hour(s) and 4 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0000) [23:55:54] Good thing that's empty. [23:59:11] yep