[00:00:05] brennen and thcipriani: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0000). [00:01:33] RECOVERY - Check systemd state on miscweb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:49] RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:08] (03CR) 10Krinkle: [C: 03+2] rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214) (owner: 10Krinkle) [00:09:15] * Krinkle staging on mwdebug1002 [00:10:37] (03CR) 10Krinkle: [C: 03+2] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle) [00:11:40] (03Merged) 10jenkins-bot: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle) [00:12:49] (03CR) 10Krinkle: [C: 03+2] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 (owner: 10Krinkle) [00:14:03] (03Merged) 10jenkins-bot: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 (owner: 10Krinkle) [00:15:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:06] !log krinkle@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: I5810472ae (duration: 03m 20s) [00:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:39] (03PS2) 10Krinkle: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) [00:17:43] (03CR) 10Krinkle: [C: 03+2] Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [00:19:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:19:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:26] (03Merged) 10jenkins-bot: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [00:21:03] !log krinkle@deploy1002 Synchronized src/Profiler.php: I14ebd2e93ad (duration: 03m 31s) [00:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:17] (03Merged) 10jenkins-bot: rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214) (owner: 10Krinkle) [00:27:26] (03PS1) 10Cwhite: logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565) [00:28:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:29:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:14] !log krinkle@deploy1002 Synchronized src/Profiler.php: I43a9e838c287 (1/4) (duration: 03m 32s) [00:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:34:22] !log krinkle@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: I43a9e838c28745906 (2/4) (duration: 03m 37s) [00:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:40] !log krinkle@deploy1002 Synchronized wmf-config/: I43a9e838c28745906 Labs+ProductionServices (3+4/4) (duration: 03m 36s) [00:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:51] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.15/includes/libs/rdbms/: I99b817b3d50ffcdf56, T310214 (duration: 03m 23s) [00:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:54] T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist - https://phabricator.wikimedia.org/T310214 [00:51:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:52:57] 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10Krinkle) Ref {T310225}. Ref .... [00:53:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.647 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:04:29] (03PS2) 10Krinkle: multiversion: Simplify code and improve documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 [01:05:21] (03CR) 10Krinkle: "We can probably cut a fair bit of this down in future patches, but this is mostly a dump of my prior knowledge and whatever I could find i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle) [01:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:33:12] (03PS1) 10Catrope: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) [01:51:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:53:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:50:33] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:50:51] (03PS1) 10Samwilson: [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725) [03:05:18] (03PS5) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129) [03:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:38:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:38:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:44:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:53:39] (03PS2) 10KartikMistry: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) [04:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:34:41] (03PS1) 10DLynch: Sync sampling rates at 9 wikis DiscussionTools is testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804022 (https://phabricator.wikimedia.org/T309260) [04:45:52] * kart_ deploying cxserver.. [04:46:09] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) (owner: 10KartikMistry) [04:49:18] (03Merged) 10jenkins-bot: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) (owner: 10KartikMistry) [04:50:02] (03PS1) 10Tim Starling: Switch wgMainStash back to Redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804024 (https://phabricator.wikimedia.org/T212129) [04:54:08] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:45] .. and staging upgrade seems stuck.. [05:03:19] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:15] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:35] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:08:53] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:09:34] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:55] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) Mark asked me to prepare a rollback plan which can be used to switch back to Redis if something goes w... [05:12:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:03] https://phabricator.wikimedia.org/P29564 is log of failure, looks like timeout. Will surely need service SREs to debug further. [05:19:47] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:01] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:23:31] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:39] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:32:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29565 and previous config saved to /var/cache/conftool/dbconfig/20220609-053253-marostegui.json [05:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:57] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:36:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:43:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29566 and previous config saved to /var/cache/conftool/dbconfig/20220609-054306-marostegui.json [05:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:10] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:58:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29567 and previous config saved to /var/cache/conftool/dbconfig/20220609-055811-marostegui.json [05:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0600). [06:02:21] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [06:02:21] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [06:04:29] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:15] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:08:39] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:10:59] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:13:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29568 and previous config saved to /var/cache/conftool/dbconfig/20220609-061316-marostegui.json [06:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:43] (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/804190 [06:15:10] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) Thank you Tim! I will bring this up on our Team meeting on Monday. [06:16:29] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:17:16] (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/804190 (owner: 10Marostegui) [06:18:47] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:11] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:24:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:27:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:28:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29569 and previous config saved to /var/cache/conftool/dbconfig/20220609-062821-marostegui.json [06:28:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:28:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:27] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29570 and previous config saved to /var/cache/conftool/dbconfig/20220609-062829-marostegui.json [06:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:57] (03CR) 10Muehlenhoff: [C: 03+2] wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:30:03] (03PS2) 10Muehlenhoff: wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) [06:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29571 and previous config saved to /var/cache/conftool/dbconfig/20220609-063443-marostegui.json [06:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:48] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:42:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:49:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29572 and previous config saved to /var/cache/conftool/dbconfig/20220609-064948-marostegui.json [06:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet [06:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:05] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [06:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:34] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [06:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:51] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10akosiaris) Hi everyone, Since the last comment is from 2 years ago from a person no longer with t... [07:00:04] Amir1 and apergos: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0700). [07:00:04] samwilson: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] good morning! [07:00:19] we have a trainee signed up [07:01:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet [07:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:55] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:02:17] hey samwilson: you about? [07:02:41] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29573 and previous config saved to /var/cache/conftool/dbconfig/20220609-070453-marostegui.json [07:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:48] apergos: yep, here now! sorry, was eating an apple. [07:06:26] we have a trainee today [07:06:32] I forget if you do self deploys or not [07:07:04] no, very happy for you to do it, or whoever's learning :) [07:07:05] samwilson: [07:07:08] ok! [07:07:43] Amir1: do you happen to be about? I'd prefer not to train and deploy at the same time, though if you can't be here, that's ok [07:10:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3003.esams.wmnet to ganeti01.svc.esams.wmnet [07:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:19] I'm goign to assume Amir is not here and we'll proceed. this will be slower than normal because I'm talking through the procedure with our trainee :-) [07:11:42] no worries. I'm around for the next hours. [07:12:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:12:03] apergos: Amir1 is on vacation [07:12:10] ok. no worries, thanks for the info [07:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3003.esams.wmnet to ganeti01.svc.esams.wmnet [07:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:46] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:32] !log drain ganeti3002 for firmware update/reimage T308238 [07:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:36] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [07:18:31] (03CR) 10Jaime Nuche: [C: 03+2] [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:19:18] (03Merged) 10jenkins-bot: [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:19:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29574 and previous config saved to /var/cache/conftool/dbconfig/20220609-071958-marostegui.json [07:20:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:20:03] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29575 and previous config saved to /var/cache/conftool/dbconfig/20220609-072006-marostegui.json [07:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:23:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:21] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10Volans) 05Open→03Resolved a:03Volans This can be solved, was just forgotten AFAICT. We do us... [07:26:14] samwilson: your change is now live on mwdebug1002, feel free to test ;-) [07:26:30] apergos: thanks, testing now [07:28:42] apergos: hmm, I may have got something wrong with the config. I'm expecting to see a change at e.g. https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Test&action=edit but it's not working. [07:28:56] (with debug1002) [07:29:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:45] well you can't really route changes to mwdebug (prod) for beta cluster requests [07:29:58] that's a bit of a trick "testing" request :-) [07:30:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:30:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:12] oh hehe right! [07:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:31] now the +2 should have made your changes show up immediately in beta, I think. that's certainly true for mw core and extensions [07:30:40] so.... when's config go live for beta cluster? [07:30:41] I think it's true for wmf-config [07:30:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:59] can you check on a random app server over there or should I? [07:31:33] to see if the config has been updated? [07:31:36] samwilson: it seems the config change should be in beta -> https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/ [07:31:44] I'm not actually sure where to check; can you do it? [07:31:54] thanks jnuche for that! [07:32:01] in particular: https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/396/ [07:32:10] ( jnuche is our trainee doing the deployment today.) [07:32:43] jnuche: thanks! hmm I think I've got the config wrong then [07:34:38] If I quickly make a follow-up patch now can you deploy it? [07:34:41] I do see the config on a random mw instance in deployment-prep [07:34:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:35:12] (03PS1) 10Samwilson: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) [07:35:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29576 and previous config saved to /var/cache/conftool/dbconfig/20220609-073546-marostegui.json [07:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:51] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:35:51] (03PS2) 10Samwilson: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) [07:36:00] samwilson: should we deploy the current one around first, then the new one too, or revert ? [07:36:58] hmm is it bad to deploy and then follow-up? or would you rather have a revert? [07:37:05] (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/803912 [07:37:51] apergos: the new patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/804255 [07:38:08] (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/803912 (owner: 10Marostegui) [07:38:29] samwilson: in this case since it's beta, I think it's fine to just deploy the first one fully and then follow up. [07:38:59] okay, cool. yeah that sounds good. I'm sorry for the messiness! [07:39:31] no worries, we'll get it sorted :-) [07:39:51] so for the next patch, we'll do this in three steps [07:39:54] 1) merge [07:39:58] jnuche: I'm just trying to give you more practice! :P [07:40:00] 2) you test in beta then, once it's available [07:40:07] :D [07:40:14] 3) once that checks out, continue with regular deploy. sound good? [07:40:29] apergos: yep that makes sense [07:43:09] !log jnuche@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:804017|[beta cluster] Update $wgVectorMaxWidthOptions to include action=edit (T307725)]] (duration: 03m 41s) [07:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:13] T307725: Make action=edit with 2010 wikitext editor a full-width page in Vector-2022 - https://phabricator.wikimedia.org/T307725 [07:43:39] ordinarily we would ask you to test again but in this case, skipping that step, samwilson :-) [07:43:59] !log depool cp5006 for trouble shooting instance state unknown [07:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:02] (03CR) 10Jaime Nuche: [C: 03+2] [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:44:09] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:44:20] sure! [07:44:41] samwilson: please do put your second patch in the calendar [07:44:47] (03Merged) 10jenkins-bot: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson) [07:44:56] oh yep, good point; doing now [07:45:00] both for the record and for easy access to the deployment commands link :- [07:45:17] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:45:43] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) a:05calbon→03None [07:45:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) Chris approved, I think that we can proceed! [07:45:58] apergos: done https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_June_9 [07:46:12] awesome! [07:47:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) [07:47:35] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:48:09] samwilson: the second patch is in beta, please go ahead and take a look :) [07:49:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) My understanding is that we are requiring the same level of access to ORES nodes to a new team member, refactoring a bit how groups are related to each o... [07:49:15] jnuche: hooray yep it works now :-) [07:49:32] great! [07:50:40] stick around please, we'll want to have you here all the way through the official deployment [07:50:40] thanks for putting up with my confusions! [07:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29577 and previous config saved to /var/cache/conftool/dbconfig/20220609-075051-marostegui.json [07:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:27] yep, no worries I'm still here [07:51:36] samwilson: no worries! the change is now on mwdebug1001, but there's no testing to be done so we'll move on to syncing the rest of prod [07:52:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:29] I've checked anyway, and all is well :) [07:53:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:38] !log drop DRDB disk template from ml-etcd2* nodes - T310073 [07:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:42] T310073: Investigate high latencies registered by the ml-serve api control plane - https://phabricator.wikimedia.org/T310073 [07:55:40] !log jnuche@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:804255|[beta cluster] Fix $wgVectorMaxWidthOptions array depth (T307725)]] (duration: 03m 40s) [07:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:45] T307725: Make action=edit with 2010 wikitext editor a full-width page in Vector-2022 - https://phabricator.wikimedia.org/T307725 [07:56:29] samwilson: done! [07:56:45] both changes have been sync'ed out everywhere [07:58:09] jnuche: terrific, thanks :-) [07:58:10] (03PS1) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 [07:58:29] !log UTC morning backport and config training window done [07:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:01] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:01:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1061.eqiad.wmnet with OS bullseye [08:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:32] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1061.eqiad.wmnet with OS bullseye [08:01:35] (03CR) 10Filippo Giunchedi: [C: 03+2] Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi) [08:01:39] (03CR) 10Filippo Giunchedi: [C: 03+2] New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:02:19] RECOVERY - etcd request latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:02:25] (03Merged) 10jenkins-bot: Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi) [08:02:27] (03Merged) 10jenkins-bot: New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:02:55] (03PS2) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 [08:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29578 and previous config saved to /var/cache/conftool/dbconfig/20220609-080556-marostegui.json [08:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:09] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:09:45] (03PS8) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [08:12:13] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1061.eqiad.wmnet with reason: host reimage [08:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29579 and previous config saved to /var/cache/conftool/dbconfig/20220609-082102-marostegui.json [08:21:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:21:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:22:43] (03CR) 10Filippo Giunchedi: [C: 03+2] "Merging so we can (PCC) test at least" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [08:22:50] (03PS23) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [08:23:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1061.eqiad.wmnet with reason: host reimage [08:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:32:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:32:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:32:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29580 and previous config saved to /var/cache/conftool/dbconfig/20220609-083232-marostegui.json [08:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:37] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:34:53] (03CR) 10Filippo Giunchedi: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [08:38:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1061.eqiad.wmnet with OS bullseye [08:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:26] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1061.eqiad.wmnet with OS bullseye completed: - ms-be1061 (**PASS**) - Downtim... [08:39:26] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:40:21] !log mmandere@cumin1001 conftool action : set/pooled=no; selector: name=cp5006.* [08:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:04] (03PS5) 10Jaime Nuche: scap: bootstrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) [08:41:24] (03CR) 10Jaime Nuche: scap: bootstrap freshly provisioned scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [08:42:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [08:46:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:46:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29581 and previous config saved to /var/cache/conftool/dbconfig/20220609-084620-marostegui.json [08:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:25] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:46:29] (03CR) 10Muehlenhoff: [C: 03+2] scap: bootstrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [08:48:48] (03PS1) 10Filippo Giunchedi: Fix zookeeper typo [puppet] - 10https://gerrit.wikimedia.org/r/804263 [08:51:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29583 and previous config saved to /var/cache/conftool/dbconfig/20220609-090125-marostegui.json [09:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] (03PS1) 10Filippo Giunchedi: phabricator: add blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) [09:03:18] (03PS1) 10Volans: reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 [09:04:25] (03PS1) 10Jaime Nuche: scap: fix bootstrap exec command [puppet] - 10https://gerrit.wikimedia.org/r/804268 (https://phabricator.wikimedia.org/T309713) [09:05:27] (03CR) 10Btullis: [C: 03+1] "Thanks. Yes I think it's fine to start the new metrics based on the fixed typo." [puppet] - 10https://gerrit.wikimedia.org/r/804263 (owner: 10Filippo Giunchedi) [09:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:05:59] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "Cheers Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/804263 (owner: 10Filippo Giunchedi) [09:08:02] (03CR) 10Ayounsi: [C: 03+1] reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans) [09:08:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35802/console" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:08:29] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:10:21] (03CR) 10Muehlenhoff: [C: 03+2] scap: fix bootstrap exec command [puppet] - 10https://gerrit.wikimedia.org/r/804268 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [09:10:40] (03CR) 10Volans: [C: 03+2] reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans) [09:11:22] (03Merged) 10jenkins-bot: reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans) [09:11:53] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29584 and previous config saved to /var/cache/conftool/dbconfig/20220609-091224-root.json [09:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:29] T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114 [09:14:18] (03CR) 10Filippo Giunchedi: [V: 03+1] "See PCC, AFAIK it isn't possible to preview the changes on prometheus hosts due to exported resources usage (i.e. we see diff on phab host" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:15:47] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [09:16:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29585 and previous config saved to /var/cache/conftool/dbconfig/20220609-091630-marostegui.json [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:23] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:19:23] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:24:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:24:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:24:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29586 and previous config saved to /var/cache/conftool/dbconfig/20220609-092413-ladsgroup.json [09:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:21] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:26:45] !log killed enwiki's refreshlinksrecommandations (T299021) [09:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [09:28:01] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:30:15] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29587 and previous config saved to /var/cache/conftool/dbconfig/20220609-093135-marostegui.json [09:31:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:31:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:31:41] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29588 and previous config saved to /var/cache/conftool/dbconfig/20220609-093148-marostegui.json [09:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:28] (03PS1) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [09:36:49] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:27] (03PS1) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [09:38:05] (03PS2) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [09:39:01] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:49] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:50:03] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:56:34] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio) [09:58:14] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [09:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1000). [10:07:03] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:46] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [10:16:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:16:17] looking [10:16:41] checking too [10:16:50] looks like a spike for france [10:17:46] indeed, doesn't seem actionable right now to me ? [10:17:58] 👀 [10:18:18] looks like it's alredy going down too [10:18:56] sorry I forgot to ack, doing so now [10:19:04] checking librenms [10:19:50] we can see a drop of inbound traffic [10:20:09] on telia mostly [10:20:32] and drop of outbound, most likely as a consequence on telia and orange [10:20:33] https://librenms.wikimedia.org/graphs/to=1654769700/id=23135/type=port_bits/from=1654748100/ [10:21:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:21:52] indeed, and looks like it is coming back now [10:22:00] "now" as in, the next datapoint in librenms [10:22:18] indeed [10:22:25] can see the same here https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=now-1h&to=now&var-site=drmrs&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 [10:22:26] so looks like a telia "blip" [10:23:05] agreed [10:23:25] going back to my lunch, happy to discuss tuning NEL too later [10:28:33] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:32:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29590 and previous config saved to /var/cache/conftool/dbconfig/20220609-103204-marostegui.json [10:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:09] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:47:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29591 and previous config saved to /var/cache/conftool/dbconfig/20220609-104709-marostegui.json [10:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @nskaggs anyone with access to Netbox and ability to run homer (which I believe should be most of SRE) shoul... [10:50:52] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10Tobi_WMDE_SW) >>! In T310055#7989747, @KFrancis wrote: > @MoritzMuehlenhoff Thanks for checking in. Because Goran is no longer an employee of WMDE, I should process... [10:54:37] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:08] !log restart cp5006 [10:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:23] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 238.79 ms [11:02:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29592 and previous config saved to /var/cache/conftool/dbconfig/20220609-110214-marostegui.json [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29593 and previous config saved to /var/cache/conftool/dbconfig/20220609-111719-marostegui.json [11:17:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:17:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:25] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:32] !log pool cp5006 after restart [11:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:05] !log mmandere@cumin1001 conftool action : set/pooled=yes; selector: name=cp5006.* [11:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:29:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29594 and previous config saved to /var/cache/conftool/dbconfig/20220609-112945-marostegui.json [11:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:48] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:35:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:38:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons. [11:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:18] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29595 and previous config saved to /var/cache/conftool/dbconfig/20220609-114740-marostegui.json [11:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:46] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:52:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [11:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29596 and previous config saved to /var/cache/conftool/dbconfig/20220609-120245-marostegui.json [12:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti3002.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti3002.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:28] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [12:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29597 and previous config saved to /var/cache/conftool/dbconfig/20220609-121750-marostegui.json [12:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:46] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [12:23:48] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti3002 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye. [12:32:06] 10SRE, 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10Jclark-ctr) 05Open→03Resolved Reseated power cable [12:32:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29598 and previous config saved to /var/cache/conftool/dbconfig/20220609-123256-marostegui.json [12:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:33:01] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:33:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [12:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [12:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:39:04] 10SRE, 10Analytics: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10BTullis) @hashar Is this latency with Archiva still apparent? I guess it probably is, since you had to increase the timeouts again in February of this year. [12:43:00] 10SRE, 10Analytics, 10Data-Engineering, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10BTullis) Adding the #data-engineering tag so that this ticket does not get dropped when we deprecate #analytics. [12:45:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:45:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29599 and previous config saved to /var/cache/conftool/dbconfig/20220609-124529-marostegui.json [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:34] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:47:37] RECOVERY - IPMI Sensor Status on ganeti1023 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:47:59] (03PS1) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) [12:48:00] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [12:49:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:49:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:38] !log installing xen security updates (client-side libs only) [12:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:35] !log installing libjpeg-turbo security updates [13:00:36] 10SRE, 10Analytics, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BTullis) Should we decline this ticket? [13:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29600 and previous config saved to /var/cache/conftool/dbconfig/20220609-130042-marostegui.json [13:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:47] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:01:06] (03PS1) 10Jaime Nuche: scap: switch over from Debian package to self-installed scap [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) [13:01:21] (03CR) 10Filippo Giunchedi: ""experimental" for now to have sth to put out there and iterate on" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:03:32] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [13:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:10:07] (03CR) 10Hnowlan: [C: 03+2] Re-enable OSM sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/803893 (owner: 10MSantos) [13:15:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1062.eqiad.wmnet with OS bullseye [13:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:13] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1062.eqiad.wmnet with OS bullseye [13:15:35] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29601 and previous config saved to /var/cache/conftool/dbconfig/20220609-131548-marostegui.json [13:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:20] (03PS1) 10Jaime Nuche: scap: remove scap Debian package from targets [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559) [13:17:49] (03PS4) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [13:19:07] (03CR) 10JMeybohm: black format cookbooks/sre/__init__.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm) [13:19:11] (03Abandoned) 10JMeybohm: black format cookbooks/sre/__init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm) [13:20:22] (03PS5) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [13:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:22:00] 10SRE, 10Data-Engineering, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10JAllemandou) >>! In T232795#7991992, @BT... [13:23:27] (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [13:26:06] (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [13:27:46] (03PS6) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [13:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29602 and previous config saved to /var/cache/conftool/dbconfig/20220609-133053-marostegui.json [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:59] (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::ferm_misc: Remove old buster IDP nodes [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [13:34:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1062.eqiad.wmnet with reason: host reimage [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:08] (03CR) 10Ayounsi: netops: add PingUnavailable alert (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:37:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1062.eqiad.wmnet with reason: host reimage [13:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29603 and previous config saved to /var/cache/conftool/dbconfig/20220609-134558-marostegui.json [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:06] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:47:56] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:50] (03PS2) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) [13:50:56] (03CR) 10Filippo Giunchedi: "Thank you for the feedback!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:54:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1062.eqiad.wmnet with OS bullseye [13:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1062.eqiad.wmnet with OS bullseye completed: - ms-be1062 (**PASS**) - Downtim... [13:57:34] (03CR) 10Ayounsi: netops: add PingUnavailable alert (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:59:29] (03PS2) 10Muehlenhoff: Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749 [14:07:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1063.eqiad.wmnet with OS bullseye [14:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:00] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1063.eqiad.wmnet with OS bullseye [14:09:31] !log masking Excimer/Arclamp services/timers on webperf1002/2002 T305460 [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:35] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [14:11:03] (03CR) 10Filippo Giunchedi: netops: add PingUnavailable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:12:16] (03PS3) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) [14:14:18] (03CR) 10Ayounsi: netops: add PingUnavailable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:15:29] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:43] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:52] (03PS4) 10Filippo Giunchedi: netops: add PingUnreachable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) [14:16:54] ^this is expected, I'll downtime/ack [14:17:10] (03CR) 10Filippo Giunchedi: netops: add PingUnreachable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [14:17:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on webperf1002.eqiad.wmnet with reason: Migration to new Bullseye nodes [14:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on webperf1002.eqiad.wmnet with reason: Migration to new Bullseye nodes [14:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on webperf2002.codfw.wmnet with reason: Migration to new Bullseye nodes [14:17:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on webperf2002.codfw.wmnet with reason: Migration to new Bullseye nodes [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:01] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:25] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [14:26:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1063.eqiad.wmnet with reason: host reimage [14:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1063.eqiad.wmnet with reason: host reimage [14:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:32:41] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10Cmjohnson) 05Open→03Declined duplicate [14:35:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] (03CR) 10Muehlenhoff: [C: 03+2] Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749 (owner: 10Muehlenhoff) [14:36:21] (03CR) 10Herron: [C: 03+1] sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:43:56] (03PS1) 10MSantos: mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330 [14:44:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:28] (03PS2) 10Cwhite: logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565) [14:45:35] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [14:45:47] (03PS2) 10Muehlenhoff: Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460) [14:48:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [14:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) [14:52:08] (03CR) 10Muehlenhoff: [C: 03+2] Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [14:52:30] (03CR) 10Cwhite: [C: 03+2] logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:53:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:22] (03CR) 10BCornwall: Traffic: Add PyBal BGP sessions (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [14:55:02] (03CR) 10Ahmon Dancy: "LGTM overall." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle) [14:55:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:19] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [14:56:04] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [14:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [14:57:38] (03PS1) 10Muehlenhoff: ALso point codfw to the new node [puppet] - 10https://gerrit.wikimedia.org/r/804333 (https://phabricator.wikimedia.org/T305460) [14:58:39] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10ayounsi) 05Declined→03Open The topic came back again Today as hosts requests in T307641 got provisioned without the additional IPs requiring heavy manual work to get it fixed. And it's not t... [14:59:31] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:37] (03CR) 10Muehlenhoff: [C: 03+2] ALso point codfw to the new node [puppet] - 10https://gerrit.wikimedia.org/r/804333 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [15:00:00] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) 05Resolved→03Open Re-opening as those were not provisioned as cassandra hosts and the additional DNS records where not generated by the provisioning scr... [15:00:10] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [15:01:13] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [15:01:13] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [15:01:37] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:03:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:20] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:37] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [15:05:37] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [15:07:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1063.eqiad.wmnet with OS bullseye [15:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:37] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1063.eqiad.wmnet with OS bullseye completed: - ms-be1063 (**PASS**) - Downtim... [15:10:07] (03PS3) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [15:10:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:23] (03PS1) 10Muehlenhoff: Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) [15:15:08] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: bullseye upgrade - bking@cumin1001 - T289135 [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:13] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [15:16:19] (03PS1) 10Muehlenhoff: Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460) [15:18:01] (03PS1) 10Muehlenhoff: coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) [15:19:39] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2036.codfw.wmnet with OS bullseye [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:49] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:20:30] (03PS1) 10Muehlenhoff: Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) [15:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:33:51] (03PS1) 10Ayounsi: CDN: disable caching for netbox-exports [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) [15:35:13] (03PS2) 10DCausse: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) [15:35:43] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage [15:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:38:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage [15:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:12] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:44:50] (03CR) 10Ayounsi: [C: 03+2] CDN: disable caching for netbox-exports [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:45:18] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) I've created the records for 2 cassandra instances (`-a` and `-b`) in Netbox. Changelog: https://netbox.wikimedia.org/extras/changelog/?request_id=774eba10-... [15:46:37] !log set cache "pass" to netbox-exports [15:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:43] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:29] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:48:13] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:36] ^^ these are getting reimaged ATM, not sure why alerts are still on [15:48:42] will try and silence [15:52:21] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster [15:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster execute... [15:53:08] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) 05Open→03Resolved Re-resolving. [15:53:12] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [15:53:21] (03CR) 10Jdlrobson: [C: 03+1] Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope) [15:53:30] !log installing curl security updates [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:41] (03PS3) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [15:54:12] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [15:56:45] 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10Krinkle) [15:57:52] !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001 [15:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:09] (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4066 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [15:58:40] mhhh checking [15:58:43] sweet :) [15:58:58] looking as well [15:59:16] thank you jhathaway ! [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2036.codfw.wmnet with OS bullseye [16:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:47] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye completed... [16:00:50] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [16:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:59] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [16:04:59] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [16:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:43] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [16:09:08] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [16:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2054.codfw.wmnet with OS bullseye [16:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:07] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [16:10:11] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye [16:13:09] (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4074 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [16:14:56] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2054.codfw.wmnet with OS bullseye [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:04] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye executed... [16:16:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Papaul) I tested the pxe boot on an-worker1142 and server was not getting anything from dhcp server after debug , I found out that the server... [16:17:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: bullseye upgrade - bking@cumin1001 - T289135 [16:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:09] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:21:57] (03PS3) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [16:29:07] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:32:59] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:36:17] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330 (owner: 10MSantos) [16:38:03] (03CR) 10Cathal Mooney: netops: add PingUnreachable alert (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [16:39:51] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330 (owner: 10MSantos) [16:43:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001 [16:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:56] jouncebot nowandnext [16:46:56] For the next 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1600) [16:46:56] In 1 hour(s) and 13 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800) [16:52:18] dancy: nothing happening in the puppet window [16:52:29] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: testing [16:52:30] 👍🏾 Thanks Reuven [16:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:49] (03PS1) 10Esanders: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 [17:01:38] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3002.esams.wmnet with OS bullseye [17:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:42] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye [17:03:41] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) elastic2053.codfw.wmnet and elastic2054.codfw.wmnet both failed to reimage, this could be the same outdated firmware issue we saw when r... [17:04:14] 10SRE, 10Analytics: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) @BTullis yes archiva is still rather slow. From the verbose curl commands above T273086#6783722, there is a large delay (1+ seconds) before the transfer start and t... [17:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:10:26] (03CR) 10AOkoth: [C: 03+1] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:11:27] (03CR) 10Dzahn: [C: 03+2] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:11:36] (03PS2) 10Dzahn: vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) [17:11:39] (03CR) 10Dzahn: [V: 03+2] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:12:12] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:43] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:12:46] (03CR) 10Ottomata: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [17:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:16] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:35] (03PS1) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 [17:14:06] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:21] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:06] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:57] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3002.esams.wmnet with reason: host reimage [17:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:14] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "typo in here, Arnold following up" [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:22:22] (03PS1) 10AOkoth: vrts: fix type in timer name [puppet] - 10https://gerrit.wikimedia.org/r/804397 (https://phabricator.wikimedia.org/T293942) [17:23:14] (03CR) 10Dzahn: [C: 03+2] vrts: fix type in timer name [puppet] - 10https://gerrit.wikimedia.org/r/804397 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:23:42] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3002.esams.wmnet with reason: host reimage [17:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:53] (03PS1) 10AOkoth: vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) [17:29:08] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:29:23] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [17:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:31:12] (03CR) 10Dzahn: "let's not change the user name yet" [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:31:16] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: vtrs_train_spamassassin.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:09] ^ on that [17:32:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1064.eqiad.wmnet with OS bullseye [17:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:23] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1064.eqiad.wmnet with OS bullseye [17:32:31] (03PS2) 10AOkoth: vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) [17:33:04] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:33:52] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:18] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [17:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:27] (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:39:29] i think i'll re-roll group0 a little early unless someone objects. we're doing group0/group1/all today if all goes well with the former two [17:39:43] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [17:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:51] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [17:40:06] (03CR) 10AOkoth: [C: 03+1] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:40:14] (03CR) 10Dzahn: [C: 03+2] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:40:19] (03PS2) 10Dzahn: vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) [17:40:21] (03CR) 10CI reject: [V: 04-1] Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [17:40:24] (03CR) 10Dzahn: [V: 03+2] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:40:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3002.esams.wmnet with OS bullseye [17:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:47] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye completed: - ganeti3002 (**PASS**) - Downtimed on Icinga/Ale... [17:41:42] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:42:21] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3002 firmware updated for nic, bios, and idrac. reimaged and ready for next one after you juggle this back into service =] [17:42:51] (03CR) 10AOkoth: [C: 03+1] vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:43:39] (03CR) 10Dzahn: [C: 03+2] vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:44:17] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [17:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:25] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [17:44:34] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [17:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:41] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [17:46:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1064.eqiad.wmnet with reason: host reimage [17:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:14] (03PS1) 10Dduvall: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068) [17:48:16] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [17:48:16] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:24] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [17:48:28] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [17:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:36] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [17:49:39] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [17:49:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1064.eqiad.wmnet with reason: host reimage [17:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:34] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.15 refs T308068 [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:38] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [17:54:27] (03Abandoned) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [17:54:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:55:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:34] (03CR) 10AOkoth: [C: 03+1] mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [17:57:42] (03CR) 10Dzahn: [C: 03+2] mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:00:04] dduvall and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800). [18:01:11] (03Restored) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [18:01:35] (03PS2) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 [18:03:04] (03CR) 10Dduvall: [C: 03+2] Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [18:04:15] !log dduvall@deploy1002 backport aborted: (duration: 00m 08s) [18:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:16] jouncebot: nowandnext [18:06:16] For the next 1 hour(s) and 53 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800) [18:06:16] In 1 hour(s) and 53 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T2000) [18:10:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1064.eqiad.wmnet with OS bullseye [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:52] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1064.eqiad.wmnet with OS bullseye completed: - ms-be1064 (**PASS**) - Downtim... [18:13:00] (03PS1) 10Dzahn: prometheus:ops: rename otrs references to vrts [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942) [18:13:43] (03CR) 10Dduvall: [C: 03+2] "Approved via scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [18:20:50] (03Merged) 10jenkins-bot: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall) [18:21:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:22] !log dduvall@deploy1002 Started scap: Backport for [[gerrit:803922]] Truncate failed requests errors to 4kB [18:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:00] (03PS1) 10Andrew Bogott: Magnum: switch config to expect rabbit over TLS [puppet] - 10https://gerrit.wikimedia.org/r/804419 [18:26:09] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: switch config to expect rabbit over TLS [puppet] - 10https://gerrit.wikimedia.org/r/804419 (owner: 10Andrew Bogott) [18:26:30] !log dduvall@deploy1002 Finished scap: Backport for [[gerrit:803922]] Truncate failed requests errors to 4kB (duration: 04m 08s) [18:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:55] (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068) [18:26:57] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:27:44] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:28:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:28:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:31:33] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.15 refs T308068 [18:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:36] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [18:34:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:07] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.15 refs T308068 (duration: 03m 34s) [18:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:40] (03PS1) 10Andrew Bogott: magnum: fill in the [keystone_auth] section [puppet] - 10https://gerrit.wikimedia.org/r/804423 [18:46:19] (03CR) 10Andrew Bogott: [C: 03+2] magnum: fill in the [keystone_auth] section [puppet] - 10https://gerrit.wikimedia.org/r/804423 (owner: 10Andrew Bogott) [18:46:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:46:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:36] (03PS1) 10Dduvall: all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068) [18:50:42] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:51:59] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:52:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:26] !log T309648 Copied newly built `wmf-elasticsearch-search-plugins` from stretch to bullseye (`root@apt1001:/home/ryankemper# reprepro copy bullseye-wikimedia stretch-wikimedia wmf-elasticsearch-search-plugins`); then ran `apt update` on `relforge*`; new plugin package showing as available now: `6.8.23-3~stretch 1001` [18:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:30] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [18:54:08] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:00] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.15 refs T308068 [18:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:07] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [18:57:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:27] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [18:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:30] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [19:02:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:02:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:50] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 137 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 140, active_shards: 140, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 137, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [19:02:50] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.54151624548736 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:04:52] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [19:04:52] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:06:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] (03CR) 10Daniel Kinzler: "James saind on Slack that it's fine. But it needs manual deployment. We can do that together next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [19:17:25] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [19:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:33] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [19:21:32] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [19:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:39] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [19:21:41] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [19:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:22] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [19:24:22] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:24:28] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 140, active_shards: 280, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [19:24:28] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:32:28] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:14] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:36:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:20] (03PS1) 10Andrew Bogott: keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431 [19:39:14] (03PS2) 10Andrew Bogott: magnum keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431 [19:40:37] (03CR) 10Andrew Bogott: [C: 03+2] magnum keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431 (owner: 10Andrew Bogott) [19:43:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:40] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [19:46:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1014.mgmt.eqiad.wmnet with reboot policy FORCED [19:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:30] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [19:46:44] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle Yep, that summary sounds right to me. That's wha... [19:47:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:34] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:38] RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [19:51:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:53] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:12] (03CR) 10Dave Pifke: [C: 03+1] Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [19:58:26] (03CR) 10Dave Pifke: [C: 03+1] Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [19:58:37] (03CR) 10Dave Pifke: [C: 03+1] Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [19:58:58] (03CR) 10Dave Pifke: [C: 03+1] coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [19:59:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T2000). [20:00:05] mewoph and hauskatze: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:23] 👋 [20:00:27] howdy! I can deploy today. [20:01:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1014.mgmt.eqiad.wmnet with reboot policy FORCED [20:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:19] train the trainers [20:01:31] (03CR) 10Thcipriani: [C: 03+2] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [20:01:41] somebody has to :) [20:01:45] :) [20:02:54] (03CR) 10Thcipriani: [C: 03+2] kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio) [20:03:30] (03PS1) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 [20:03:37] hauskatze: do I need to run namespacedupes after I sync your change? [20:03:39] Who needs training, if they do it for you? :) [20:03:42] (03Merged) 10jenkins-bot: kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio) [20:03:48] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2053.codfw.wmnet [20:03:49] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2053.codfw.wmnet [20:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:02] thcipriani: let me see if it applies clean on mwdebug first [20:04:11] 1 or 2 by the way? [20:04:31] (03CR) 10CI reject: [V: 04-1] Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs) [20:04:34] hauskatze: I only see one patch? [20:04:47] thcipriani: yep, but which mwdebug server are we using? [20:05:00] 1001 or 1002 ? [20:05:25] hauskatze: should be on mwdebug1002 now [20:05:32] checking [20:05:39] at least wgSiteName should be checkable [20:06:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:01] thcipriani: lgtm; namespaceDupes is to be run first *without* --fix (dryrun) [20:07:16] :) on it. [20:07:35] if you can Paste the output it'd be nice [20:07:51] I do try to be nice [20:08:30] maybe phaste can do it for you? [20:08:47] I never remember how to do that from the servers (IIRC there's a way [20:08:49] ) [20:08:52] I'll just copy and paste [20:08:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1015.mgmt.eqiad.wmnet with reboot policy FORCED [20:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:13] (03PS2) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 [20:09:32] never used it myself but it might be something like mwscript [blah blah] | phaste [20:10:11] (03PS3) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 [20:10:13] (03CR) 10jenkins-bot: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs) [20:10:37] hauskatze: https://phabricator.wikimedia.org/P29607 [20:10:44] checking [20:11:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:20] thcipriani: looks great, all links can be fixed by the script [20:11:28] looks like it [20:11:32] I think we can deploy and run the script aftewards with --fix [20:11:40] perfect, doing now [20:11:56] (03PS1) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) [20:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:58] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:20] (03PS2) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) [20:13:22] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) [20:13:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10nskaggs) Thanks for the explanation. I just want to make sure if not a cookbook, then a runbook at least to make it v... [20:14:06] (03CR) 10Dzahn: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:14:27] (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:14:45] (03CR) 10Andrew Bogott: [C: 03+2] Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs) [20:15:34] hrm, php-fpm check-and-restart taking a while [20:16:12] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:803916|kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk (T309866)]] (duration: 03m 36s) [20:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] T309866: Localisation of the namespaces in the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T309866 [20:17:30] (03PS3) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) [20:17:32] (03PS3) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) [20:18:01] there we go, running namespacedupes --fix now hauskatze [20:18:13] !log mwmaint1002:mwscript namespaceDupes.php kywiki --fix [20:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:27] (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:18:27] thanks :) [20:20:33] updated the same paste with the output: tl;dr: 276 fixed [20:21:24] dancy: are we now forcing restarts? [20:21:32] checking [20:21:59] thcipriani: Yes, always restart is enabled now [20:22:08] so it takes about 3 minutes to complete restarts. [20:22:30] thcipriani: output looks good to me [20:22:39] hauskatze: nice, thanks for checking :) [20:22:41] unless you feel otherwise? [20:22:50] nope, lgtm, too [20:22:55] great [20:23:22] I'll make a note in the task that the other translations for Scribunto and Gadgets and MWCore will take up to one week to display over there [20:23:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1015.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:27] dancy: I note there was a spike in errors with the last deploy, looks like jobrunners at a glance—expected? [20:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] and the l10nupdate script will take care of updating everything iirc [20:23:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [20:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:02] thcipriani: jobrunners are excluded from restarts, [20:24:03] (03CR) 10Dzahn: [C: 03+1] "+1 to the key names. this matches prod and they were missing. Can't speak for the actual machine names but ok to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:25:41] (03Merged) 10jenkins-bot: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [20:26:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [20:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [20:26:52] hrm, no, it's not jobrunners, it's mw hosts, but it's doing something with RunSingleJob [20:27:25] seems to have stopped, but that was a prolonged wave :\ [20:28:13] (03Abandoned) 10Dzahn: delete expired ldap-corp certificates [puppet] - 10https://gerrit.wikimedia.org/r/791677 (owner: 10Dzahn) [20:28:41] (03PS2) 10Dzahn: Revert "Revert "phabricator: allow disabling ssh-phab service except on one host"" [puppet] - 10https://gerrit.wikimedia.org/r/778243 [20:29:31] (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:29:33] (03CR) 10Krinkle: [C: 03+1] "Tested in beta" [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:30:24] mewoph: still around? (sorry for delay) [20:30:40] thcipriani: yes! no problem [20:31:41] mewoph: your change is on mwdebug1002, check please [20:32:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:46] (03CR) 10Dzahn: [C: 03+2] Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:32:50] thcipriani: lgtm [20:33:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:33:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:34] thanks for deploying my patch thcipriani - always a pleasure :) [20:34:50] hauskatze: sure thing, and likewise :) [20:35:04] mewoph: cool, thanks for checking, syncing [20:35:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [20:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [20:36:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster [20:36:05] (03CR) 10Dzahn: "You might need 'profile::swift::accounts_keys' too. see https://puppet-compiler.wmflabs.org/pcc-worker1003/35803/deployment-webperf22.dep" [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [20:36:22] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:31] thcipriani: thank you! [20:37:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [20:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [20:38:21] (03Abandoned) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [20:38:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [20:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [20:39:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) [20:39:31] (03CR) 10Dzahn: [C: 03+1] "works on deployment-deploy03.deployment-prep and deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud. error above just affects w" [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:39:52] !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/GrowthExperiments/modules: Backport: [[gerrit:803969|Suggested edits: Fix loading states when fetching additional tasks (T309926)]] (duration: 03m 37s) [20:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:55] T309926: Suggested edits: edits browsing bug - https://phabricator.wikimedia.org/T309926 [20:39:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [20:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:06] mewoph: ^ should be live now, and you're welcome :) [20:40:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster [20:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster exec... [20:40:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:43:52] !log end utc late backport window [20:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:04] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) >>! In T269328#7992477, @ayounsi wrote: > The topic came back again Today as hosts requests in T307641 got provisioned without the additional IPs requiring heavy manual work to get it fi... [20:44:46] (03CR) 10Ahmon Dancy: [V: 03+1] "Tested in beta. Works now that the groups are set up." [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [20:44:58] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [20:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:07] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [20:45:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [20:46:38] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1144.eqiad.wmnet with OS buster [20:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster exec... [20:46:50] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1146.eqiad.wmnet with OS buster [20:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec... [20:47:00] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1145.eqiad.wmnet with OS buster [20:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec... [20:49:40] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [20:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:48] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [20:52:06] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) Also, via https://cassandra.apache.org/_/blog/Configurable-Storage-Ports-and-Why-We-Need-Them.html: > ### How Do My Other Cassandra Nodes Know About Different storage_port Settings? > !... [20:52:16] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1143.eqiad.wmnet with OS buster [20:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [20:56:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage [20:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:57] (03PS4) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [20:59:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage [20:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:49] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) And to address @elukey's observation about what the docs say (updated url for that is now [[ https://cassandra.apache.org/doc/trunk/cassandra/configuration/cass_yaml_file.html#seed_provi... [21:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:09:06] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye [21:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:14] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye [21:09:57] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:10:58] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) On elastic2053.codfw.wmnet: - Updated iDRAC to firmware 5.10.10.00 (took 2 updates, first to 3.30.30) - Updated NIC firmware to... [21:12:24] (03PS1) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) [21:12:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1142.eqiad.wmnet with OS buster [21:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster comp... [21:13:25] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye [21:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:32] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed... [21:13:59] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) Reimaging failed again, will try again with a different host when work resumes (maybe next week?) [21:16:37] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:25:05] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:15] (03CR) 10Dzahn: [C: 03+2] scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [21:34:04] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Stalled→03Open [21:34:09] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [21:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:11] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:36:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:38:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [21:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:29] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:25] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) I talked with Jesse about all this. We agreed I will follow-up about the last few things, you Faidon, also mentioned in our mail. cpt-leads@, techchom... [21:50:45] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Open→03In progress [22:03:45] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) Ok, so I do need to walk some of this back. It //is// now possible in 4.x (the documentation is correct in that context), thanks to [[ https://issues.apache.org/jira/browse/CASSANDRA-75... [22:11:03] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:20:47] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:36:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:02:06] (03PS1) 10Zabe: httpbb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804465 (https://phabricator.wikimedia.org/T308013) [23:02:08] (03PS1) 10Zabe: galera: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804466 (https://phabricator.wikimedia.org/T308013) [23:02:10] (03PS1) 10Zabe: fifo_log_demux: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804467 (https://phabricator.wikimedia.org/T308013) [23:02:12] (03PS1) 10Zabe: external_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013) [23:02:14] (03PS1) 10Zabe: external_clouds_vendors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013) [23:02:16] (03PS1) 10Zabe: eventschemas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804470 (https://phabricator.wikimedia.org/T308013) [23:02:18] (03PS1) 10Zabe: etcdmirror: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804471 (https://phabricator.wikimedia.org/T308013) [23:02:20] (03PS1) 10Zabe: envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804472 (https://phabricator.wikimedia.org/T308013) [23:02:22] (03PS1) 10Zabe: dumpsuser: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013) [23:02:24] (03PS1) 10Zabe: docker_registry_ha: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) [23:02:27] (03PS1) 10Zabe: docker_pusher: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013) [23:02:29] (03PS1) 10Zabe: docker_pkg: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013) [23:02:31] (03PS1) 10Zabe: cpufrequtils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013) [23:02:33] (03PS1) 10Zabe: conntrackd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013) [23:16:27] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:22:08] (03CR) 10Legoktm: [C: 03+1] "I used cumin to verify that all hosts already have cgroup-tools installed, so this is a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [23:22:15] (03PS4) 10Legoktm: mediawiki: Use non-transitional cgroups package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [23:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:30:10] (03CR) 10Legoktm: [C: 03+2] mediawiki: Use non-transitional cgroups package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [23:31:04] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Package 'cgroup-bin' has no installation candidate on Debian 11 (modules/mediawiki/manifests/cgroup.pp) - https://phabricator.wikimedia.org/T309449 (10Legoktm) 05Open→03Resolved [23:35:29] (03CR) 10Legoktm: [C: 04-1] docker_registry_ha: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [23:35:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:38:53] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:43:43] !issync [23:43:44] Syncing #wikimedia-operations (requested by legoktm) [23:43:46] Set /cs flags #wikimedia-operations topranks +Aiotv [23:43:48] Set /cs flags #wikimedia-operations rzl +Aiotv [23:44:12] thanks! [23:44:51] yw :) [23:45:21] thanks for the +2 legoktm :) [23:48:55] yw too! [23:54:18] (03PS6) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)