[00:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T0000). [00:00:40] !log phabricator update deployed. [00:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:17] abwiktionary [00:45:18] ----------------------------------------------------------------- [00:45:18] [2ff55b4cfe3904e0c8995da2] [no req] MWException: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [00:45:18] Backtrace: [00:45:18] from /srv/mediawiki/php-1.37.0-wmf.14/includes/cache/localisation/LocalisationCache.php(512) [00:45:19] * Reedy squints [00:46:04] Why does that only fail in foreachwiki [00:49:40] o_o [00:50:29] I guess I shouldn't run scripts on deploy hosts [00:51:19] O-O [00:59:45] (03Abandoned) 10DannyS712: Avoid passing invalid offset to mb_strpos [extensions/AbuseFilter] (wmf/1.37.0-wmf.13) - 10https://gerrit.wikimedia.org/r/703902 (https://phabricator.wikimedia.org/T285978) (owner: 10Daimona Eaytoy) [01:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:57] (03CR) 10Krinkle: [C: 03+1] Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [03:07:05] (03PS1) 10Tim Starling: Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704618 (https://phabricator.wikimedia.org/T286521) [03:07:59] (03PS1) 10Tim Starling: Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) [03:22:09] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:32:40] (03CR) 10Ppchelko: [C: 03+1] Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704618 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [03:32:50] (03CR) 10Ppchelko: [C: 03+1] Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [03:53:42] (03CR) 10jerkins-bot: [V: 04-1] Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [03:54:23] (03CR) 10Ppchelko: [C: 03+1] "recheck" [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [03:55:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:00:45] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:08:43] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [04:11:36] 10SRE, 10SRE Observability, 10Traffic, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [04:23:01] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:27] PROBLEM - SSH on mw1297.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:58] Request from - via cp1081.eqiad.wmnet, ATS/8.0.8 [04:30:58] Error: 502, Next Hop Connection Failed at 2021-07-15 03:59:20 GMT [04:31:09] Hmm [04:37:15] (03CR) 10Tim Starling: [C: 03+2] Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704618 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [04:37:20] (03CR) 10Tim Starling: [C: 03+2] Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [04:42:19] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:43:25] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:24] (03PS1) 10KartikMistry: Update cxserver to 2021-07-14-124232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/704659 (https://phabricator.wikimedia.org/T282369) [04:55:26] (03CR) 10Andrew Bogott: "I think this is the right idea to deal with T286675 but I'm having a hard time thinking it through. My maybe-not-totally-coherent thought" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [04:56:21] (03Merged) 10jenkins-bot: Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704618 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [04:56:27] (03Merged) 10jenkins-bot: Don't try to delete non-existent rows when saving options. [core] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704619 (https://phabricator.wikimedia.org/T286521) (owner: 10Tim Starling) [05:34:58] * kart_ updating cxserver.. [05:36:04] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-07-14-124232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/704659 (https://phabricator.wikimedia.org/T282369) (owner: 10KartikMistry) [05:38:29] (03Merged) 10jenkins-bot: Update cxserver to 2021-07-14-124232-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/704659 (https://phabricator.wikimedia.org/T282369) (owner: 10KartikMistry) [05:41:50] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [05:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:41] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:05] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:09] !log Updated cxserver to 2021-07-14-124232-production (T282369, T284450) [05:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:17] T282369: In ContentTranslation, some references are still going missing... - https://phabricator.wikimedia.org/T282369 [05:50:18] T284450: Create Wikipedia Dagbani - https://phabricator.wikimedia.org/T284450 [06:30:01] RECOVERY - SSH on mw1297.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:12] (03CR) 10Ryan Kemper: "Looks good, will leave to you to +2 when you're ready to merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans) [06:47:20] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.12/includes/user/UserOptionsManager.php: don't delete non-existent rows (T286521) (duration: 01m 07s) [06:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:28] T286521: Deadlock found when trying to get lock (UserOptionsManager::saveOptionsQuery) - https://phabricator.wikimedia.org/T286521 [06:47:54] (03PS1) 10Elukey: elasticsearch: improve check_elasticsearch_shard_size.py [puppet] - 10https://gerrit.wikimedia.org/r/704746 [06:48:47] dcausse: o/ morning! If you have a min later on can you tell me if --^ makes sense? [06:49:36] (03PS4) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) [06:49:51] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.14/includes/user/UserOptionsManager.php: don't delete non-existent rows (T286521) (duration: 01m 06s) [06:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:57] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:23] !log restart phabricator_clean_tmp_files.service on phab1001 - transient error (tmp files already cleaned up) [06:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:26] (03PS5) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) [06:53:09] (03CR) 10Muehlenhoff: [C: 03+2] Don't show Kerberos ticket info in general [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [07:16:13] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:21] !log remove /etc/rawdog/en/{state,state.lock} on planet1002 (following what rawdog suggested) due to corrupted files (backups available in /home/elukey/en) [07:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:17] (03PS1) 10Filippo Giunchedi: install_server: use Bullseye on thanos-fe2001 [puppet] - 10https://gerrit.wikimedia.org/r/704747 (https://phabricator.wikimedia.org/T285835) [07:23:11] !log restart planet-update-en.service on planet1002 [07:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:07] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use Bullseye on thanos-fe2001 [puppet] - 10https://gerrit.wikimedia.org/r/704747 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [07:26:51] (03CR) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [07:27:00] (03PS3) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) [07:31:27] !log reimage thanos-fe2001 with bullseye - T285835 [07:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:34] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [07:35:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:40:05] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [07:40:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:53] thanos compact is me [07:43:52] (03PS4) 10Volans: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 [07:44:19] (03CR) 10Volans: [C: 03+2] tox: remove flake8-import-order [software/spicerack] - 10https://gerrit.wikimedia.org/r/704344 (owner: 10Volans) [07:48:09] !log updated bullseye d-i image for latest daily build T275873 [07:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:16] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [07:48:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:49:03] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704556 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [07:49:37] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: backup1007, thanos-be1003, backup2005, dragonfly-supernode1001, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:49:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: dragonfly-supernode1001, backup1007, backup2005, thanos-be1003, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:49:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: thanos-be1003, backup1007, labstore1006, backup2005, dragonfly-supernode1001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:49:50] (03Merged) 10jenkins-bot: tox: remove flake8-import-order [software/spicerack] - 10https://gerrit.wikimedia.org/r/704344 (owner: 10Volans) [08:01:15] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:21] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:02:21] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:02:47] PROBLEM - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:02:50] ACKNOWLEDGEMENT - HP RAID on ms-be2038 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T286698 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Ha [08:02:50] aid_Information_Gathering [08:02:53] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10ops-monitoring-bot) [08:05:10] (03PS2) 10Jcrespo: mediabackup: Remove deleted directories and files [puppet] - 10https://gerrit.wikimedia.org/r/704599 (https://phabricator.wikimedia.org/T276442) [08:07:53] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Remove deleted directories and files [puppet] - 10https://gerrit.wikimedia.org/r/704599 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:11:12] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw[1414-1418].eqiad.wmnet with reason: change new eqiad appservers to canary https://phabricator.wikimedia.org/T279309 [08:11:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw[1414-1418].eqiad.wmnet with reason: change new eqiad appservers to canary https://phabricator.wikimedia.org/T279309 [08:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:12:25] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw141[4-8].eqiad.wmnet [08:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:00] (03CR) 10Volans: "As discussed on IRC, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [08:14:51] (03PS2) 10JMeybohm: dragonfly: Trim newlines in config files [puppet] - 10https://gerrit.wikimedia.org/r/704360 (https://phabricator.wikimedia.org/T286054) [08:14:53] (03PS8) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [08:15:02] (03CR) 10Jelto: [V: 03+1 C: 03+2] role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:15:12] (03CR) 10Volans: [C: 03+2] Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans) [08:15:17] (03PS5) 10Jelto: role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) [08:15:19] (03CR) 10Addshore: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [08:15:45] (03CR) 10Jelto: [V: 03+2 C: 03+2] role::common::mediawiki::canary_appserver add new canary app server in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [08:16:47] (03CR) 10DCausse: [C: 03+1] "makes sense, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/704746 (owner: 10Elukey) [08:16:58] elukey: thanks ^ :) [08:18:31] (03Merged) 10jenkins-bot: Use IcingaHosts instead of Icinga (search) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans) [08:18:35] dcausse: thank you! merging :) [08:18:40] (03CR) 10Elukey: [C: 03+2] elasticsearch: improve check_elasticsearch_shard_size.py [puppet] - 10https://gerrit.wikimedia.org/r/704746 (owner: 10Elukey) [08:18:42] hashar: ping re: https://gerrit.wikimedia.org/r/c/integration/config/+/701370 [08:19:51] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:22:44] kormat: in a call will follow up after it is done [08:22:58] hashar: np, it's not urgent. just been sitting around for a few weeks :) [08:26:37] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:57] ufff [08:27:28] An error occurred while reading state from /etc/rawdog/en/feeds/847a7185.state. [08:29:50] !log sudo rm /etc/rawdog/en/feeds/847a7185.state* on planet1002 (corrupted file) - backup in /home/elukey + restart planet-update-en.service [08:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:23] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:07] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [08:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:17] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [08:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:48] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:35:43] (03CR) 10Zabe: Adding e use square wordmark for trwikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [08:37:36] (03CR) 10Zabe: "The wikimedia site logo should be a svg, not a png" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [08:40:00] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - Content not updated recently (172848 172800) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:41:50] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) Ok so Bullseye d-i can't detect the link up on the broadcom 10G nic, which in the past meant we have to upgrade the NIC's f... [08:50:14] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.518e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:52:01] (03CR) 10JMeybohm: [C: 04-1] "Apart from the if-guard comment. This LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [08:56:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T278619 [08:56:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: Deploying schema change to s6 T278619 [08:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:50] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [08:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T278619 [08:58:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Deploying schema change to s5 T278619 [08:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:42] kormat: jbond: I am updating the puppet compiler job for https://gerrit.wikimedia.org/r/c/integration/config/+/701370 :) [09:03:23] hashar: ack thanks [09:03:56] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:04:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T278619 [09:04:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s2 T278619 [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:26] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [09:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:47] (03CR) 10JMeybohm: dragonfly: Trim newlines in config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704360 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:09:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T278619 [09:09:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Deploying schema change to s7 T278619 [09:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:00] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [09:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:00] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw141[4-8].eqiad.wmnet [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:26] (03PS3) 10Hashar: beta: add warning motd and link to term of uses [puppet] - 10https://gerrit.wikimedia.org/r/699207 (https://phabricator.wikimedia.org/T100837) [09:12:02] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:15:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 18 hosts with reason: Deploying schema change to s7 T278619 [09:15:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 18 hosts with reason: Deploying schema change to s7 T278619 [09:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:14] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [09:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 18 hosts with reason: Deploying schema change to s1 T278619 [09:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 18 hosts with reason: Deploying schema change to s1 T278619 [09:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:02] (03CR) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [09:19:05] (03PS4) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) [09:20:31] (03PS9) 10JMeybohm: kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) [09:22:16] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Make profile and module ensureable [puppet] - 10https://gerrit.wikimedia.org/r/704318 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:22:19] (03CR) 10JMeybohm: [C: 03+2] dragonfly: Trim newlines in config files [puppet] - 10https://gerrit.wikimedia.org/r/704360 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:22:24] (03CR) 10JMeybohm: [C: 03+2] kubernetes::*::worker: include dragonfly dfdaemon [puppet] - 10https://gerrit.wikimedia.org/r/704322 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:25:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 12 hosts with reason: Deploying schema change to s3 T278619 [09:25:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 12 hosts with reason: Deploying schema change to s3 T278619 [09:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:14] T278619: ipb_timestamp is varbinary(14) in old wikis while being binary(14) in the code since 2007 - https://phabricator.wikimedia.org/T278619 [09:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:56] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:29:28] PROBLEM - SSH on mw1273.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:32:36] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [09:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [09:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:14] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:38:32] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:24] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:40:08] (03PS3) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [09:41:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:41:12] (03PS12) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [09:41:15] (03PS5) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:42:29] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:42:33] (03CR) 10Volans: systemdlogind-logout.py: Check login state prior to logout attempt (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [09:44:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:46:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:47:53] (03PS13) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [09:47:57] (03PS6) 10Elukey: WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [09:48:01] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Write certificate to /etc/dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/704755 (https://phabricator.wikimedia.org/T286054) [09:49:03] (03CR) 10jerkins-bot: [V: 04-1] WIP - Add kubeflow's kfserving chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:49:24] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Write certificate to /etc/dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/704755 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [09:50:12] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.518e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:50:21] (03PS5) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) [09:50:30] (03CR) 10Elukey: "All right I think that the chart is ready for a review. The only "custom" parts are defined in the values.yaml file, mostly related to min" [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [09:50:51] (03CR) 10Muehlenhoff: systemdlogind-logout.py: Check login state prior to logout attempt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [09:50:54] (03CR) 10Hnowlan: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30226/console" [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [09:54:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=thanos-compact site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:36] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:56:35] (03PS1) 10JMeybohm: admin_ng: Fix name of rbac apiGroup in tiller-flink clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/704757 (https://phabricator.wikimedia.org/T264006) [09:59:19] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Fix name of rbac apiGroup in tiller-flink clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/704757 (https://phabricator.wikimedia.org/T264006) (owner: 10JMeybohm) [10:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1000). [10:01:20] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:48] (03Merged) 10jenkins-bot: admin_ng: Fix name of rbac apiGroup in tiller-flink clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/704757 (https://phabricator.wikimedia.org/T264006) (owner: 10JMeybohm) [10:02:03] !log disableing puppet on maps* for 704394 [10:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:21] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:53] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:24] (03CR) 10Volans: [C: 03+1] "LGTM, ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:05:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30227/console" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [10:06:23] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [10:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:05] (03PS8) 10Effie Mouzeli: profile::osm_master: add tilerator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) [10:12:04] (03CR) 10Effie Mouzeli: [C: 03+2] profile::osm_master: add tilerator users for kubepod subnets [puppet] - 10https://gerrit.wikimedia.org/r/704394 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [10:14:06] (03CR) 10Martaannaj: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [10:14:47] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Fix HTTPS_PROXY URI to actually use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/704758 (https://phabricator.wikimedia.org/T286054) [10:16:07] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Fix HTTPS_PROXY URI to actually use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/704758 (https://phabricator.wikimedia.org/T286054) (owner: 10JMeybohm) [10:16:13] (03CR) 10Muehlenhoff: "Thanks for the review, appreciated!" [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:16:16] (03CR) 10Muehlenhoff: [C: 03+2] systemdlogind-logout.py: Check login state prior to logout attempt [puppet] - 10https://gerrit.wikimedia.org/r/704584 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:16:49] moritzm: okay to merge systemdlogind-logout.py: Check login state prior to logout attempt (a33c12d8dd) ? [10:17:46] yes, please [10:18:06] done [10:20:41] (03CR) 10Klausman: [C: 03+1] Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [10:21:39] hashar: hurray, thanks! :) [10:26:08] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:08] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 10 hosts [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:36] kormat: hopefully it works as intended :] [10:26:42] I am out for lunch, be back later [10:26:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Muehlenhoff out of all services on: 10 hosts [10:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:50] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [10:30:12] RECOVERY - SSH on mw1273.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:31:35] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:44] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:52] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [10:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:56] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [10:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:17] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [10:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [10:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] (03PS1) 10Muehlenhoff: logout: Catch RemoteExecutionError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) [10:36:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:38:49] (03PS2) 10Muehlenhoff: logout: Catch RemoteExecutionError exception [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) [10:39:59] (03CR) 10Ladsgroup: "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/703912 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [10:40:06] (03CR) 10Volans: [C: 03+1] logout: Catch RemoteExecutionError exception (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:41:54] !log move wikibase.queryService.ui.app to wikibase.queryService.ui.index.app - T272128 [10:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] T272128: Fix tracking for query service UI - https://phabricator.wikimedia.org/T272128 [10:54:48] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, feel free to add it to a backport window (https://wikitech.wikimedia.org/wiki/Backport_windows) :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) (owner: 10Martaannaj) [10:54:55] (03CR) 10Arturo Borrero Gonzalez: jessie deprecation: don't build jessie containers when rebuilding (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/703005 (owner: 10Bstorm) [10:55:22] (03CR) 10Muehlenhoff: logout: Catch RemoteExecutionError exception (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:56:23] !log commented out cron-spam entries on thanos-fe2001, puppet is disabled, thanos-store.service fails to start - T285835 [10:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:30] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [10:59:33] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10Volans) @fgiunchedi I've commented the cron-spam entries in `/var/spool/cron/crontabs/root` because it was sending spam every minute. A... [10:59:55] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10Volans) [edited previous message as I hit submit too soon by mistake] [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: #bothumor My software never has bugs. It just develops random features. Rise for EU Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1100). [11:00:09] o/ [11:00:26] I'm here [11:00:32] no one is scheduled for training today [11:00:39] although we will have someone on the 22nd!! [11:00:42] nobody scheduled any patches either [11:00:43] no patches in the window [11:00:48] jinx :-P [11:00:59] nice and quiet window :) [11:01:02] heh [11:01:11] good, I'll get some other work done then ;-) [11:01:13] (03CR) 10Volans: [C: 03+1] "Reply inline, still +1 :)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:02:20] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:15] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on thanos-fe2001.codfw.wmnet with reason: Extending downtime post-reimage [11:05:16] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on thanos-fe2001.codfw.wmnet with reason: Extending downtime post-reimage [11:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:23] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10ops-monitoring-bot) Icinga downtime set by volans@cumin2002 for 3:00:00 1 host(s) and their services with reason: Extending downtime po... [11:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:59] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10Volans) I've added 3 more downtime hours to the host as the original one from the reimage is about to expire. [11:08:24] (03CR) 10Muehlenhoff: [C: 03+2] logout: Catch RemoteExecutionError exception (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/704761 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:09:15] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 10 hosts [11:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 10 hosts [11:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:45] I'm gonna quickly deploy this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704527/ [11:23:29] and then quickly leave the country [11:25:52] kormat: sheesh :D [11:26:20] (03CR) 10Ladsgroup: [C: 03+2] Make idwiki use protect mode of flaggedrevs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704527 (https://phabricator.wikimedia.org/T268317) (owner: 10Ladsgroup) [11:26:34] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:38] (03Merged) 10jenkins-bot: Make idwiki use protect mode of flaggedrevs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704527 (https://phabricator.wikimedia.org/T268317) (owner: 10Ladsgroup) [11:32:02] (03CR) 10Dzahn: [C: 03+2] site/conftool: turn mw1423,mw1424,mw1425 into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/704556 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [11:32:09] (03PS2) 10Dzahn: site/conftool: turn mw1423,mw1424,mw1425 into API appservers [puppet] - 10https://gerrit.wikimedia.org/r/704556 (https://phabricator.wikimedia.org/T279309) [11:33:01] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, most recent changes just adjust minor thing based on jbond's previous comments so will merge." [puppet] - 10https://gerrit.wikimedia.org/r/704559 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:33:07] (03CR) 10Cathal Mooney: [C: 03+2] librenms: Drop absented crons [puppet] - 10https://gerrit.wikimedia.org/r/704559 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:34:30] !log installing libuv1 security updates [11:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:08] !log restarting Turnilo to pick up libuv security update [11:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:33] (03PS1) 10Ladsgroup: flaggedrevs: Allow admins of idwiki to change stablesettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704773 (https://phabricator.wikimedia.org/T268317) [11:37:54] (03CR) 10Ladsgroup: [C: 03+2] flaggedrevs: Allow admins of idwiki to change stablesettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704773 (https://phabricator.wikimedia.org/T268317) (owner: 10Ladsgroup) [11:38:34] (03Merged) 10jenkins-bot: flaggedrevs: Allow admins of idwiki to change stablesettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704773 (https://phabricator.wikimedia.org/T268317) (owner: 10Ladsgroup) [11:40:17] !log restarting Etherpad to pick up libuv security update [11:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:53] topranks: can I merge your puppet chage on the master? i get multiple on merge [11:42:07] dropping absented crons looks harmless [11:43:28] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704527|Make idwiki use protect mode of flaggedrevs (T268317)]] (duration: 01m 07s) [11:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:34] T268317: Reconfigure FlaggedRevs at the Indonesian Wikipedia - https://phabricator.wikimedia.org/T268317 [11:45:49] * mutante boldly merges it [11:47:17] !log mw1423, mw1424, mw1425 - initial puppet run, new API appservers going into production [11:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[1423-1425].eqiad.wmnet with reason: new host [11:47:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[1423-1425].eqiad.wmnet with reason: new host [11:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:06] !log restarting restbase1028-1030 to pick up libuv security update [11:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:30] mutante: yes, apologies, I was working through docs trying to make sure I followed the process 100%. [11:48:52] and scratching my head as to why "puppet-merge" wasn't showing any change. [11:49:20] (you got there ahead ofme) [11:49:25] topranks: :) I just typed "multiple" a few seconds before that, it's merged [11:49:31] As you said the change is fairly simple, I will verify everything looks good. Thanks! [11:49:34] it seemed harmless enough to do that [11:49:46] removing code of already absented crons [11:49:50] are you converting to timers? [11:49:59] yep all good. [11:50:03] that's cool [11:50:14] Amir1 did the work but yes it's part of an ongoing project to do that across the board as I understand. [11:50:29] what terrible thing I have done [11:50:39] Which is cool, timers are nice IMO, much more visible than cronjobs. [11:50:47] yes, it is, I am involved in that, we want to get rid of all crons and had some progress bar for that :) thank you [11:50:56] Amir1: :) [11:51:05] Only good things Amir, I'm just giving credit where credit is due :) [11:51:37] all those crons you have already done, Amir :) thanks or that [11:51:52] mutante: haha, I just migrated librenms ones (8) and bother topranks for merging them [12:01:06] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:02] !log mw1423,mw1424,mw1425 - rebooting [12:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:35] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7214436, @fgiunchedi wrote: > Ok so Bullseye d-i can't detect the link up on the broadcom 10G nic, which in... [12:18:28] ACKNOWLEDGEMENT - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - Content not updated recently (185910 172800) daniel_zahn https://phabricator.wikimedia.org/T285251 https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [12:19:08] ACKNOWLEDGEMENT - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 +0000 (expires in 29 days) daniel_zahn this is the global cert https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [12:22:00] mutante: o/ do you know anything about planet-update-en.service ? [12:22:13] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 653 mismatched wikiversions daniel_zahn https://phabricator.wikimedia.org/T286463#7212231 https://wikitech.wikimedia.org/wiki/Application_servers [12:22:13] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2384 is CRITICAL: Host mw2384 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T286463#7212231 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:22:33] on planet1002, it has been failing.. I tried to re-run it removign corrupted files but I didn't really resolved anything [12:23:05] elukey: I know that it is broken and started debugging the other day but I dont know the answer yet why it broke [12:23:14] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Dzahn) >>! In T286463#7212231, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/YM18pXoB1jz_IcWuQmuH} [2021-07-14T14:47:01Z] set mw2384 as inactiv... [12:23:16] it's not about restarting the service itself [12:23:27] it's about the service not being able to fetch external content [12:23:46] what corrupted files though? [12:24:14] mutante: the service is able to fetch from outside afaics, but rawdog complains about corrupted files (see journal log) [12:24:37] I tried to remove them and restart, the unit works for a bit and then fails again [12:25:06] elukey: ok, thanks, i'll look. I manually started the update process for debugging, could be related. dont' worry about it [12:25:28] ack thanks :) [12:25:36] (I logged all the actions taken in the SAL if you need it) [12:25:43] for some reason when I did that I saw it "timed out" trying to fetch the feeds [12:26:06] almost like I have been blocked by UA.. but it seemed all of them.. maybe about setting the proxies.. hrmm [12:26:09] ACK, ty [12:26:11] (03PS1) 10Filippo Giunchedi: swift: use addresses for memcached [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) [12:26:14] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:18] heh [12:26:39] also that doesnt have to be eqiad, maybe i'll just try codfw to compare [12:27:38] ACKNOWLEDGEMENT - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service daniel_zahn https://phabricator.wikimedia.org/T285251 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30228/console" [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:30:37] (03PS2) 10Filippo Giunchedi: swift: use addresses for memcached [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) [12:31:09] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [12:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:59] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30229/console" [puppet] - 10https://gerrit.wikimedia.org/r/704777 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:32:08] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 +0000 (expires in 29 days) daniel_zahn https://phabricator.wikimedia.org/T286713 https://phabricator.wikimedia.org/tag/phabricator/ [12:33:18] ACKNOWLEDGEMENT - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:27] !log mw1423, mw1424, mw1425 - scap pull [12:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:02] (03CR) 10Muehlenhoff: "The Debian-specific plumbing happens in init-systemd-helpers. There's no changelog entry which would explain why it's broken in Buster, bu" [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [12:35:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1103.eqiad.wmnet with reason: Rebooting db1103 for kernel upgrade T273281 [12:35:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1103.eqiad.wmnet with reason: Rebooting db1103 for kernel upgrade T273281 [12:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:07] (03CR) 10Dzahn: "oops, this change actually removed mw1422 from site.pp (just noticed because Icinga told me there was no puppet run on it since earlier to" [puppet] - 10https://gerrit.wikimedia.org/r/704103 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [12:38:02] 10SRE: Update SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Samtar) [12:40:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db[1102-1103,1120,1137].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Rebooting db1103 (x1 primary) for kernel upgrade T273281 [12:40:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db[1102-1103,1120,1137].eqiad.wmnet,dbstore1005.eqiad.wmnet with reason: Rebooting db1103 (x1 primary) for kernel upgrade T273281 [12:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1103.eqiad.wmnet with reason: Rebooting for T273281 [12:40:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1103.eqiad.wmnet with reason: Rebooting for T273281 [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:12] 10SRE: Update SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Majavah) SSH keys in LDAP (used for Cloud VPS/Toolforge) can be managed via [[ https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/ | toolsadmin ]] or [[ https://wikitech.wikimedia.org/wiki/Special:Preferences#mw... [12:41:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2142.codfw.wmnet with reason: Rebooting db1103 (x1 primary) for kernel upgrade T273281 [12:41:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2142.codfw.wmnet with reason: Rebooting db1103 (x1 primary) for kernel upgrade T273281 [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:14] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 #page on db2096 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1103.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1103.eqiad.wmnet (111 Connection refused) Kormat Rebooting db1103 (x1 primary in eqiad) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_re [12:44:51] (03PS1) 10Dzahn: site: re-add mw1422 to regex for appservers [puppet] - 10https://gerrit.wikimedia.org/r/704778 (https://phabricator.wikimedia.org/T279309) [12:46:05] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/704778 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:46:17] 10SRE, 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) indeed, it's auto-renewed by acme-chief, we should tune those checks. The new cert has been issued already and it's being staged to avoid client-side clock skew issues: ` Ju... [12:46:29] (03CR) 10Dzahn: [C: 03+2] site: re-add mw1422 to regex for appservers [puppet] - 10https://gerrit.wikimedia.org/r/704778 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:46:54] (03PS1) 10Samtar: Update SSH key for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704779 (https://phabricator.wikimedia.org/T286714) [12:47:12] 10SRE, 10Traffic, 10good first task: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) p:05Triage→03Low [12:50:02] (03PS1) 10Filippo Giunchedi: hieradata: move thanos swift stats reports [puppet] - 10https://gerrit.wikimedia.org/r/704780 (https://phabricator.wikimedia.org/T285835) [12:50:04] 10SRE, 10Patch-For-Review: Update SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Samtar) >>! In T286714#7214967, @Majavah wrote: > SSH keys in LDAP (used for Cloud VPS/Toolforge) can be managed via [[ https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/ | toolsadmin ]] o... [12:51:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw142[3-5].eqiad.wmnet [12:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:23] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move thanos swift stats reports [puppet] - 10https://gerrit.wikimedia.org/r/704780 (https://phabricator.wikimedia.org/T285835) (owner: 10Filippo Giunchedi) [12:51:29] (03PS2) 10Filippo Giunchedi: hieradata: move thanos swift stats reports [puppet] - 10https://gerrit.wikimedia.org/r/704780 (https://phabricator.wikimedia.org/T285835) [12:53:22] (03PS1) 10DCausse: [flink-session-cluster] Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 [12:53:41] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw142[3-5].eqiad.wmnet [12:53:42] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:17] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw142[3-5].eqiad.wmnet [12:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:34] !log mw1423, mw1424, mw1425 - pooled as new API servers [12:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw142[3-5].eqiad.wmnet [12:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:21] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T273281 [12:55:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T273281 [12:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:45] 10SRE, 10Traffic, 10good first task: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold? If traffic doesn't need... [12:57:08] 10SRE, 10Patch-For-Review: Update SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Vgutierrez) I'd recommend just removing any existing SSH keys for samtar, and adding them iff shell access to production is granted. [12:57:28] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:57:46] (03PS2) 10Effie Mouzeli: flink-session-cluster: Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 (owner: 10DCausse) [12:58:08] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10fgiunchedi) Leaving this here for tracking, I'm seeing a permission error from node-exporter on a Bullseye host (thanos-fe2001). It looks like the values in the file are bogus anyways... [12:58:50] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:58:54] (03CR) 10Effie Mouzeli: [C: 03+1] flink-session-cluster: Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 (owner: 10DCausse) [12:58:56] 10SRE, 10Patch-For-Review: Update SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Samtar) >>! In T286714#7215003, @Vgutierrez wrote: > I'd recommend just removing any existing SSH keys for samtar, and adding them iff shell access to production is granted. Sounds reasonable! Wil... [13:01:51] (03PS3) 10DCausse: flink-session-cluster: Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 [13:01:59] (03PS2) 10Dzahn: conftool: remove mw1300, mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/704289 (https://phabricator.wikimedia.org/T280203) [13:02:26] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1165 (s6 sanitarium master) for kernel upgrade T273281 [13:02:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1165 (s6 sanitarium master) for kernel upgrade T273281 [13:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1165.eqiad.wmnet with reason: Rebooting for T273281 [13:03:01] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1165.eqiad.wmnet with reason: Rebooting for T273281 [13:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:21] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1422.eqiad.wmnet [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:36] 10SRE, 10Patch-For-Review: Remove SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Samtar) [13:04:10] (03PS2) 10Samtar: Remove SSH key for samtar [puppet] - 10https://gerrit.wikimedia.org/r/704779 (https://phabricator.wikimedia.org/T286714) [13:05:26] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 (owner: 10DCausse) [13:05:47] !log mw1413 - pooling, was depooled but for unknown reason, dont see it in SAL, looks ok, scap pulled [13:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1413.eqiad.wmnet [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] (03Merged) 10jenkins-bot: flink-session-cluster: Set .Values.main_app.port to 8081 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704781 (owner: 10DCausse) [13:10:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw130[0-1].eqiad.wmnet [13:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:11] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) >>! In T275873#7215007, @fgiunchedi wrote: > Leaving this here for tracking, I'm seeing a permission error from node-exporter on a Bullseye host (thanos-fe2001). It... [13:12:08] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:57] (03PS4) 10Muehlenhoff: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [13:16:36] (03CR) 10jerkins-bot: [V: 04-1] debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [13:17:04] !log jelto@cumin1001 START - Cookbook sre.dns.netbox [13:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:20] (03PS3) 10Dzahn: DHCP/conftool: remove mw1300, mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/704289 (https://phabricator.wikimedia.org/T280203) [13:20:44] (03PS1) 10Jgiannelos: Maps: filter out non-administrative boundaries on OSM import [puppet] - 10https://gerrit.wikimedia.org/r/704784 [13:20:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1300.eqiad.wmnet [13:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:09] !log mw1300, mw1301 - jobrunners going out of service, decom [13:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:36] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [13:23:05] (03CR) 10Jgiannelos: "I have already tested this with some parts of the world and it looks like it fixes the border issues we have. For now we can filter out on" [puppet] - 10https://gerrit.wikimedia.org/r/704784 (owner: 10Jgiannelos) [13:23:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:26:12] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:43] (03PS1) 10Dzahn: httpbb: use https in tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704786 (https://phabricator.wikimedia.org/T267607) [13:30:38] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.518e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:31:20] (03CR) 10Dzahn: [C: 03+2] httpbb: use https in tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704786 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [13:33:05] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw130[0-1].eqiad.wmnet [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:28] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01829 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [13:34:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1300.eqiad.wmnet [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:39] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1300.eqiad.wmnet` - m... [13:34:57] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [13:35:08] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) p:05Medium→03High [13:36:49] (03CR) 10Dzahn: httpbb: add tests for noc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704297 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [13:37:14] (03CR) 10Dzahn: "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/noc/* --hosts mwmaint1002.eqiad.wmnet,mwmaint2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/704786 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [13:38:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1301.eqiad.wmnet [13:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:52] 10SRE, 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Aklapper) @Vgutierrez: A #good_first_task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributo... [13:42:12] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:42:46] 10SRE, 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) >>! In T286713#7215004, @Dzahn wrote: > Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold?... [13:44:02] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:45:25] 10SRE, 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) ACK! So.. we still want to monitor if TLS works on planet and phabricator, we just don't want to deal with cert expiry anymore. We need to create a new checkcommand probably. On... [13:45:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps200[7-9].codfw.wmnet [13:45:44] 10SRE, 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) a:03Dzahn [13:45:44] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps20(0[1-6]|10).codfw.wmnet [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2021.codfw.wmnet,es[1020-1022].eqiad.wmnet with reason: Rebooting es1021 (es4 eqiad primary) for kernel upgrade T273281 [13:46:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2021.codfw.wmnet,es[1020-1022].eqiad.wmnet with reason: Rebooting es1021 (es4 eqiad primary) for kernel upgrade T273281 [13:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:17] 10SRE, 10Traffic: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) [13:46:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T273281 [13:46:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T273281 [13:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:32] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps20(0[1-6]|10).codfw.wmnet [13:47:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps200[7-9].codfw.wmnet [13:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:51] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps2009.codfw.wmnet [13:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:47] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [13:59:14] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10MoritzMuehlenhoff) [13:59:18] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [13:59:21] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [14:04:34] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1301.eqiad.wmnet [14:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:06] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1301.eqiad.wmnet` - m... [14:11:08] (03PS5) 10Muehlenhoff: debian::autostart: function to prevent services autostarting on install [puppet] - 10https://gerrit.wikimedia.org/r/701538 (owner: 10Jbond) [14:11:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es2023.codfw.wmnet,es[1023-1025].eqiad.wmnet with reason: Rebooting es1024 (es5 eqiad primary) for kernel upgrade T273281 [14:11:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es2023.codfw.wmnet,es[1023-1025].eqiad.wmnet with reason: Rebooting es1024 (es5 eqiad primary) for kernel upgrade T273281 [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T273281 [14:11:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T273281 [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:43] (03CR) 10Dzahn: [C: 03+2] "decom cookbook finished for both hosts" [puppet] - 10https://gerrit.wikimedia.org/r/704289 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [14:26:10] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps200[7-9].codfw.wmnet [14:33:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps20(0[1-6]|10).codfw.wmnet [14:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a respon [14:38:16] eceived https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [14:40:09] (03PS1) 10Muehlenhoff: Enable debian::autostart on sretest* for some tests [puppet] - 10https://gerrit.wikimedia.org/r/704795 [14:40:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:40:18] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps20(0[1-6]|10).codfw.wmnet [14:40:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps200[7-9].codfw.wmnet [14:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2008.codfw.wmnet, maps2009.codfw.wmnet, maps2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:41:48] PROBLEM - kartotherian endpoints health on maps2008 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:42:00] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [14:42:07] ^ that is me, reverted and should recover in a second [14:43:36] RECOVERY - kartotherian endpoints health on maps2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:44:00] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:44:57] (03PS1) 10Effie Mouzeli: conftool-data: add mwdebug discovery [puppet] - 10https://gerrit.wikimedia.org/r/704799 (https://phabricator.wikimedia.org/T283056) [14:47:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/704795 (owner: 10Muehlenhoff) [14:47:51] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T273281 [14:47:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T273281 [14:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:54] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10fgiunchedi) ooof, this host looks like it has another failed bbu? we've seen the same issue not even a month ago in the same host. what do you think @papaul ? [14:51:48] (03PS1) 10Jelto: site: assign role gitlab to gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/704801 (https://phabricator.wikimedia.org/T285870) [14:52:38] (03PS7) 10Juan90264: Adding and use square wordmark for trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [14:53:18] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) >>! In T285835#7214769, @Volans wrote: > I've added 3 more downtime hours to the host as the original one from the reimage... [14:54:25] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 (10fgiunchedi) The host remains depooled but thanos-compact is running fine. Despite the fact that s3api doesn't seem to work out of the b... [14:54:29] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30230/console" [puppet] - 10https://gerrit.wikimedia.org/r/704801 (https://phabricator.wikimedia.org/T285870) (owner: 10Jelto) [14:57:12] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10fgiunchedi) >>! In T275873#7215066, @MoritzMuehlenhoff wrote: >>>! In T275873#7215007, @fgiunchedi wrote: >> Leaving this here for tracking, I'm seeing a permission error from node-exp... [15:01:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:10] !log installing nginx security updates on ms-fe* [15:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:04] !log temporary becoming admin on idwiki to debug T268317 [15:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:10] T268317: Reconfigure FlaggedRevs at the Indonesian Wikipedia - https://phabricator.wikimedia.org/T268317 [15:06:22] (03PS8) 10Juan90264: Adding square logo and wordmark for Wikimania [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) [15:06:42] (03CR) 10Juan90264: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [15:16:03] !log ladsgroup@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:704773|flaggedrevs: Allow admins of idwiki to change stablesettings (T268317)]], try II (duration: 01m 05s) [15:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:10] T268317: Reconfigure FlaggedRevs at the Indonesian Wikipedia - https://phabricator.wikimedia.org/T268317 [15:19:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1027.eqiad.wmnet with reason: Rebooting for T273281 [15:19:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1027.eqiad.wmnet with reason: Rebooting for T273281 [15:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:46] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:49] mutante: not sure if you were already looking at it or know more about the above ^^^ [15:28:05] seems to be flapping since this morning [15:29:31] failing to get some feed from what I can tell from a 10s look [15:35:38] was that maybe recently switched to a systemd timer so that such errors are now failing more visibly as compared to a past cron? [15:36:16] could be [15:37:28] (03PS1) 10Hnowlan: maps: disable tilerator in the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/704827 (https://phabricator.wikimedia.org/T269582) [15:38:47] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30231/console" [puppet] - 10https://gerrit.wikimedia.org/r/704827 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:39:07] (03CR) 10Jgiannelos: [C: 03+1] maps: disable tilerator in the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/704827 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:40:12] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: disable tilerator in the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/704827 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:41:52] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) 05Open→03Resolved a:03elukey istio bootstrapped, everything worked nicely, thanks a lot to all that... [15:47:01] (03CR) 10Filippo Giunchedi: "LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [15:50:01] (03PS1) 10Elukey: profile::kubernetes::master: add comments and improve hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) [15:51:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T273281 [15:51:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T273281 [15:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30232/console" [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:53:08] (03CR) 10Elukey: "Added also Brooke since this code is probably used in the cloud world, better safe than sorry and double check in any realm :)" [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [15:56:18] (03PS1) 10JMeybohm: flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) [15:56:30] (03CR) 10Jcrespo: [C: 04-1] "Thank you for the quick review! I was happy to find that no 3d party software was required for exporting (benefits of cloud-ready software" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704600 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [15:57:01] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [15:58:18] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:50] PROBLEM - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:00:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T273281 [16:00:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T273281 [16:00:05] jbond42 and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1600). [16:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:56] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:16] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:17] (03PS1) 10DCausse: flink-session-cluster: add kafka & discovery to egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/704836 [16:15:20] PROBLEM - tilerator on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:17:33] (03PS2) 10DCausse: flink-session-cluster: add kafka & discovery to egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/704836 [16:19:29] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet [16:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:52] (03CR) 10Bstorm: "> Patch Set 1: -Verified" [puppet] - 10https://gerrit.wikimedia.org/r/704831 (https://phabricator.wikimedia.org/T285927) (owner: 10Elukey) [16:20:00] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:00] ACKNOWLEDGEMENT - tilerator on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 6534: Connection refused Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:20:00] ACKNOWLEDGEMENT - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:00] ACKNOWLEDGEMENT - tilerator on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 6534: Connection refused Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:20:01] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:02] ACKNOWLEDGEMENT - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused Hnowlan Tilerator intentionally disabled. https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [16:20:07] (03CR) 10DCausse: "not sure if this will fully fix the swift issue but looking at other charts I think I missed that part of the network policies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704836 (owner: 10DCausse) [16:21:17] (03Abandoned) 10DCausse: flink-session-cluster: add kafka & discovery to egress network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/704836 (owner: 10DCausse) [16:21:24] dcausse: I see that you are having fun with kubernetes :D [16:21:34] :P [16:21:58] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled via puppet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:58] ACKNOWLEDGEMENT - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled via puppet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:58] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Tilerator intentionally disabled via puppet. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:40] (03PS1) 10Zabe: Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704811 (https://phabricator.wikimedia.org/T286611) [16:24:06] (03PS1) 10Zabe: Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704812 (https://phabricator.wikimedia.org/T286611) [16:28:10] (03PS2) 10JMeybohm: flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) [16:28:12] (03PS1) 10JMeybohm: Rakefile: Fix undefined error variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/704837 [16:29:05] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: fix network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/704838 [16:29:23] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [16:36:01] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:44] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps200[7-9].codfw.wmnet [16:39:46] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps20(0[1-6]|10).codfw.wmnet [16:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:28] (03PS1) 10Ottomata: Bump refine refinery-job version to 0.1.15 [puppet] - 10https://gerrit.wikimedia.org/r/704842 (https://phabricator.wikimedia.org/T271232) [16:43:02] !log otto@deploy1002 Started deploy [analytics/refinery@7a673c9]: Deploy refinery-source 0.1.15 with fixes for Refine jobs [16:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:56] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [16:45:06] (03CR) 10DCausse: "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) (owner: 10JMeybohm) [16:45:31] PROBLEM - kartotherian endpoints health on maps2008 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:45:59] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:46:03] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2007.codfw.wmnet, maps2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:46:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps20(0[1-6]|10).codfw.wmnet [16:46:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps200[7-9].codfw.wmnet [16:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:33] rolled back to resolve the above [16:46:51] PROBLEM - kartotherian endpoints health on maps2009 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:47:33] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:47:51] PROBLEM - kartotherian endpoints health on maps2007 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:48:29] RECOVERY - kartotherian endpoints health on maps2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:48:57] RECOVERY - kartotherian endpoints health on maps2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:49:27] RECOVERY - kartotherian endpoints health on maps2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [16:50:12] (03CR) 10Andrew Bogott: ">" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [16:52:33] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:54:43] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/704638 (https://phabricator.wikimedia.org/T286675) (owner: 10Bstorm) [16:56:21] (03PS3) 10JMeybohm: flink-session-cluster: Include discovery and kafka egress helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/704833 (https://phabricator.wikimedia.org/T265526) [16:56:23] (03PS1) 10JMeybohm: common_templates: Don't fail if kafka.allowed_clusters is not defined [deployment-charts] - 10https://gerrit.wikimedia.org/r/704843 [16:57:53] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:58:59] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:00:05] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1700). [17:00:23] !log otto@deploy1002 Finished deploy [analytics/refinery@7a673c9]: Deploy refinery-source 0.1.15 with fixes for Refine jobs (duration: 17m 21s) [17:00:27] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:41] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:02:03] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:02:47] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:02:53] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:51] !log otto@deploy1002 Started deploy [analytics/refinery@7a673c9] (hadoop-test): Deploy refinery-source 0.1.15 to hadoop-test with fixes for Refine jobs [17:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:24] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:31] !log otto@deploy1002 Finished deploy [analytics/refinery@7a673c9] (hadoop-test): Deploy refinery-source 0.1.15 to hadoop-test with fixes for Refine jobs (duration: 05m 41s) [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:36] (03CR) 10JMeybohm: [C: 03+1] "LGTM. Adding Filippo for second opinion and awareness." [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [17:21:05] (03PS1) 10Bstorm: openstack galera: set monitor on failover [puppet] - 10https://gerrit.wikimedia.org/r/704846 (https://phabricator.wikimedia.org/T286675) [17:24:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [17:25:45] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:06] (03CR) 10Ottomata: [C: 03+2] Bump refine refinery-job version to 0.1.15 [puppet] - 10https://gerrit.wikimedia.org/r/704842 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:57:28] (03CR) 10RLazarus: [C: 03+1] httpbb: use https in tests for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/704786 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1800). [18:00:04] MatmaRex: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] hi [18:01:59] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:48] is anyone available for the backport deployment? [18:08:09] MatmaRex: I can [18:08:11] * Reedy looks [18:08:23] thanks [18:09:00] (03CR) 10Reedy: [C: 03+2] Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704811 (https://phabricator.wikimedia.org/T286611) (owner: 10Zabe) [18:09:02] (03CR) 10Reedy: [C: 03+2] Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704812 (https://phabricator.wikimedia.org/T286611) (owner: 10Zabe) [18:09:08] time to wait for jerkins [18:11:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [18:15:23] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:22:40] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dduvall) >>! In T285232#7199870, @Joe wrote: > So after some more scavenging, We need the following directories to... [18:26:23] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:29] (03Merged) 10jenkins-bot: Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/704811 (https://phabricator.wikimedia.org/T286611) (owner: 10Zabe) [18:27:31] (03Merged) 10jenkins-bot: Add return value for the case of no exception [extensions/EventLogging] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704812 (https://phabricator.wikimedia.org/T286611) (owner: 10Zabe) [18:28:14] Reedy: ^ [18:30:35] (03CR) 10Ottomata: [C: 03+1] Enable kerberos for btullis [puppet] - 10https://gerrit.wikimedia.org/r/704562 (https://phabricator.wikimedia.org/T285754) (owner: 10Btullis) [18:31:12] MatmaRex: cheers for the ping [18:32:06] Reedy: wasn't me, you fell victim to tab autocompletion again. there's a lot of us with "ma-" usernames. [18:32:12] thanks for deploying [18:32:36] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:13] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/EventLogging/includes/JsonSchemaHooks.php: T286611 (duration: 01m 07s) [18:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:20] T286611: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 [18:35:26] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/EventLogging/includes/JsonSchemaHooks.php: T286611 (duration: 01m 06s) [18:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:18] MatmaRex: fixed ;P [18:37:01] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:37:03] thanks Reedy [18:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:53] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) >>! In T282484#7205448, @Volans wrote: > P.S. the cable ID `23000064` looks like a potential typo. Not a typo, we just started ordering pre-serial-labeled network cables. Since... [18:49:13] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:50] (03Abandoned) 10Ottomata: eventgate-analytics - set num_workers: 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/704588 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [19:00:04] dancy and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T1900). [19:01:07] (03PS1) 10Ottomata: eventgate-analytics bump to 2021-07-15-185242-production with prom-client fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/704853 (https://phabricator.wikimedia.org/T272714) [19:02:27] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics bump to 2021-07-15-185242-production with prom-client fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/704853 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [19:05:52] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:05:53] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] (03PS1) 10RobH: pc101[1-4] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/704854 (https://phabricator.wikimedia.org/T282484) [19:09:14] (03CR) 10RobH: [C: 03+2] pc101[1-4] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/704854 (https://phabricator.wikimedia.org/T282484) (owner: 10RobH) [19:16:03] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [19:16:39] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [19:17:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [19:23:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Volans) >>! In T282484#7215815, @RobH wrote: >>>! In T282484#7205448, @Volans wrote: >> P.S. the cable ID `23000064` looks like a potential typo. > > Not a typo,... [19:24:09] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [19:24:28] (03PS1) 10Ottomata: eventgate-analytics - fix typo in image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/704855 [19:24:43] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - fix typo in image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/704855 (owner: 10Ottomata) [19:25:59] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:35] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:26:36] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:50] !log volker-e@deploy1002 Started deploy [design/style-guide@eebdc4d]: Deploy design/style-guide: eebdc4d “Visual style – Icons”: Add Figma colors & icons file as source of truth (#484) [19:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:56] !log volker-e@deploy1002 Finished deploy [design/style-guide@eebdc4d]: Deploy design/style-guide: eebdc4d “Visual style – Icons”: Add Figma colors & icons file as source of truth (#484) (duration: 00m 05s) [19:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['pc1011.eqiad.wmnet', 'pc1012.eqiad.wmnet', 'pc1013.eqiad.... [19:35:45] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [19:40:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:41] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [19:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1011.eqiad.wmnet with reason: REIMAGE [19:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1012.eqiad.wmnet with reason: REIMAGE [19:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:13] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1011.eqiad.wmnet with reason: REIMAGE [19:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:22] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1012.eqiad.wmnet with reason: REIMAGE [19:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:32] Rolling train forward to group2 [19:51:35] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1013.eqiad.wmnet with reason: REIMAGE [19:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:29] (03PS1) 10Ahmon Dancy: group2 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704859 [19:52:31] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704859 (owner: 10Ahmon Dancy) [19:53:11] (03Merged) 10jenkins-bot: group2 wikis to 1.37.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704859 (owner: 10Ahmon Dancy) [19:53:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1013.eqiad.wmnet with reason: REIMAGE [19:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:41] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.37.0-wmf.14 [19:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:32] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: Not alerting due to fresh production wikiversions: 973 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:07:51] !log nskaggs@cumin1001 Added views for new wiki: shiwiki T284928 [20:07:51] !log nskaggs@cumin1001 END (PASS) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=0) [20:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:59] T284928: Prepare and check storage layer for shiwiki - https://phabricator.wikimedia.org/T284928 [20:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:49] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ time mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=cswiki # [20:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:00] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ time mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=cswiki # T285811 [20:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:06] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [20:16:06] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:04] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: kubernetes2002, kubernetes2001, cloudmetrics1002, thanos-be1003, labstore1006, kubernetes1012 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:17:04] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: kubernetes2001, cloudmetrics1002, kubernetes1012, thanos-be1003, kubernetes2002, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [20:26:13] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ time mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=bnwiki # T285811 [20:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:20] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [20:33:46] (03PS1) 10RLazarus: puppetmaster: Stop commits to the private repo with empty messages [puppet] - 10https://gerrit.wikimedia.org/r/704861 [20:41:32] (03CR) 10DannyS712: puppetmaster: Stop commits to the private repo with empty messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [20:41:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['pc1014.eqiad.wmnet'] ` [20:44:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` pc1014.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [20:44:15] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ time mwscript extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki=viwiki # T285811 [20:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:23] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [20:47:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [20:51:06] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:34] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (14) node(s) change every puppet run: kubernetes2010, kubernetes1012, labstore1006, kubernetes1007, thanos-be1003, kubernetes2007, kubernetes2008, kubernetes1013, kubernetes1016, kubernetes2014, kubernetes1015, kubernetes2016, cloudmetrics1002, kubernetes2012 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run [20:53:09] (03CR) 10RLazarus: puppetmaster: Stop commits to the private repo with empty messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [20:59:01] (03CR) 10DannyS712: puppetmaster: Stop commits to the private repo with empty messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [21:02:14] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:22] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 973 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [21:17:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) [21:19:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10RobH) a:05RobH→03Cmjohnson Assigning this to Chris to check the network connection for pc1014. checklist has been updated for that host, and it should be goo... [21:22:38] 10SRE, 10LDAP, 10Patch-For-Review: Remove SSH key for samtar in ldap users - https://phabricator.wikimedia.org/T286714 (10Peachey88) [21:26:30] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:16] (03CR) 10RLazarus: puppetmaster: Stop commits to the private repo with empty messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/704861 (owner: 10RLazarus) [21:44:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['pc1014.eqiad.wmnet'] ` [21:45:15] 10ops-eqiad, 10DC-Ops: hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet - https://phabricator.wikimedia.org/T286748 (10nskaggs) a:03Cmjohnson [21:55:18] (03PS1) 10Clare Ming: Update config for language switching on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704867 (https://phabricator.wikimedia.org/T286459) [21:59:54] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704867 (https://phabricator.wikimedia.org/T286459) (owner: 10Clare Ming) [22:02:04] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:32] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:58] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [23:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for US Backport and Config trainingYour patch may or may not be deployed at the sole discretion of the deployer . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210715T2300). [23:02:10] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:00] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:05:25] no deployment requests for this training window and i think everyone has been through the demo workflow, calling it early. [23:05:37] ^ cc: thcipriani. [23:05:45] brennen wait wait can I add a deployment request [23:05:59] backport of two GlobalWatchlist patches [23:06:10] DannyS712: ...kk. [23:06:28] i will be in the training call in case anyone wants to follow along. [23:06:54] (03PS1) 10DannyS712: Clean up to watchlistUtils.makeUserLinks [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704814 (https://phabricator.wikimedia.org/T286385) [23:08:35] DannyS712: any testing to be done for this one? [23:08:42] (03CR) 10Brennen Bearnes: [C: 03+2] Clean up to watchlistUtils.makeUserLinks [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704814 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:09:34] it needs to be deployed together with the other one, the state in between is broken (accidentally) - they modify the same file, and yes I can test - https://phabricator.wikimedia.org/T286385#7215453 [23:10:14] (03CR) 10DannyS712: "This change is ready for review." [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704815 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:10:28] ^ thats the other one [23:10:39] DannyS712: cool, so merge both [23:11:01] yes please - merge both, sync once afterwards [23:11:07] I'll add them to the deployment calendar [23:11:15] k, +2ing [23:11:20] (03CR) 10Brennen Bearnes: [C: 03+2] Fix creation of mw.Message objects [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704815 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:13:29] (03CR) 10jerkins-bot: [V: 04-1] Fix creation of mw.Message objects [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704815 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:13:41] (03PS1) 10RLazarus: icinga: Add type hints to icinga-status [puppet] - 10https://gerrit.wikimedia.org/r/704875 [23:14:16] (03Merged) 10jenkins-bot: Clean up to watchlistUtils.makeUserLinks [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704814 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:14:20] ^ can ignore that V-1 failure, from a prior patchset [23:14:59] thx [23:15:15] between the first one and the second there was a big refactor that caused merge conflicts for the second cherry pick, PS2 was before I fixed those [23:15:24] * brennen waits on zuul. [23:17:07] (03Merged) 10jenkins-bot: Fix creation of mw.Message objects [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/704815 (https://phabricator.wikimedia.org/T286385) (owner: 10DannyS712) [23:19:23] (03CR) 10RLazarus: "Type hints as requested! Some notes inline as pre-resolved comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus) [23:21:15] DannyS712: patch on mwdebug2001.codfw.wmnet [23:21:55] {{testing}} [23:24:25] hmm, doesn't seem to be working, but I confirmed that they fixed it on the beta cluster, one sec [23:24:49] is mwdebug2001 correct? [23:24:58] DannyS712: let me double check [23:25:37] DannyS712: yeah, confirmed [23:26:19] I did a shift refresh and it works now (I guess the old version of the ResourceLoader module was cached?...) [23:26:27] confirmed to fix the issue [23:26:46] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:20] DannyS712: cool, syncing [23:28:54] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/GlobalWatchlist/modules/watchlistUtils.js: Backport: [[gerrit:704815|Fix creation of mw.Message objects (T286385)]] (duration: 00m 57s) [23:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:04] T286385: XSS in GlobalWatchlist - https://phabricator.wikimedia.org/T286385 [23:29:50] oh, ugg, I didn't realize that allowing stashbot to see the task so that it could post comments also meant that it would read the title and post it here... [23:29:59] lessons learned [23:32:44] !log checking stashbot: T286756 [23:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:45] for the record, fix confirmed to work (after not being on mwdebug2001) (took a while because I couldn't get my cache to clear, eventually just logged out and back in and that fixed it) [23:35:29] cool, thx. calling it for this window. [23:36:29] thanks for deploying! [23:36:39] sure thing. [23:36:50] on that note, i retreat to the hammock. :) [23:41:38] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:57:40] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook