[00:01:24] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:01:44] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:05:22] PROBLEM - Check systemd state on miscweb2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:42] PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:59] (03PS1) 10Brennen Bearnes: tag-release.sh: remove submodule force-push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/803621 (https://phabricator.wikimedia.org/T309910) [00:29:48] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-05-31 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:30:47] (03PS7) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [00:30:49] (03PS6) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [00:30:52] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-05-31 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:31:39] (03CR) 10CI reject: [V: 04-1] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [00:32:30] (03CR) 10Tim Starling: "In PS7 I made it so that in the secondary data center, connecting to the x2 local master will not use SSL. Configuration overhead increase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [00:33:59] (03PS8) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [00:34:01] (03PS7) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [00:36:33] (03CR) 10Tim Starling: "PS8: phpcs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [00:56:43] (03CR) 10Brennen Bearnes: "Pushed as part of cleanup of tagging." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/803621 (https://phabricator.wikimedia.org/T309910) (owner: 10Brennen Bearnes) [01:01:06] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-06-07 00:00:01 (3105 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:15:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:20:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:20:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:33:56] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host clouddumps1001.wikimedia.org [01:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:06] 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10leila) [01:55:32] 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10leila) This request is approved on my end. (Please note that I'm not sure if other than `analytics-privatedata-users` whether Bruno needs access to another group... [01:57:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:04:48] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-06-07 00:00:01 (3105 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:09:28] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:12:38] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:22:32] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:27:02] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:00] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Looks like rising gas prices contributed to the higher freight charges. The 3x freight options would be with OSF: $3716.76, Pegasus: $5.777.78, and Hollander: $4235.89, so we'll just go ahead... [03:31:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:52:14] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:36:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:40:59] (03PS1) 10Ebernhardson: Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648) [05:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:16:22] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:20:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:21:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:27:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [05:27:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [05:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29490 and previous config saved to /var/cache/conftool/dbconfig/20220608-052745-marostegui.json [05:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:49] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:32:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29491 and previous config saved to /var/cache/conftool/dbconfig/20220608-053201-marostegui.json [05:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P29492 and previous config saved to /var/cache/conftool/dbconfig/20220608-054706-marostegui.json [05:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:13] (03PS1) 10Marostegui: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803742 (https://phabricator.wikimedia.org/T310114) [05:47:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for migration to 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29493 and previous config saved to /var/cache/conftool/dbconfig/20220608-054718-root.json [05:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:22] T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114 [05:48:46] (03CR) 10Marostegui: [C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803742 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui) [06:00:08] (03PS1) 10Marostegui: db1143: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/803743 (https://phabricator.wikimedia.org/T310114) [06:01:11] (03CR) 10Marostegui: [C: 03+2] db1143: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/803743 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui) [06:02:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P29494 and previous config saved to /var/cache/conftool/dbconfig/20220608-060211-marostegui.json [06:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:58] (03PS1) 10KartikMistry: Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793) [06:17:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29495 and previous config saved to /var/cache/conftool/dbconfig/20220608-061717-marostegui.json [06:17:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:17:21] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29496 and previous config saved to /var/cache/conftool/dbconfig/20220608-061724-marostegui.json [06:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:18] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:20:22] * kart_ updating cxserver. [06:21:06] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [06:22:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29497 and previous config saved to /var/cache/conftool/dbconfig/20220608-062245-marostegui.json [06:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:49] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:25:13] (03Merged) 10jenkins-bot: Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [06:27:22] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:59] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:23] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:22] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:33] OK. Looks like that fails.. [06:36:10] marostegui: What can be reason for `{"status":500,"type":"internal_error","title":"Error","detail":"connect ECONNREFUSED 127.0.0.1:3306","method":"GET","uri":"/v2/suggest/sections/Gujarat/en/gu"}` Ref: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663 [06:37:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P29498 and previous config saved to /var/cache/conftool/dbconfig/20220608-063751-marostegui.json [06:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:59] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [06:37:59] ource sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:39:11] ah. I'll revert. [06:39:19] kart_: yeah not sure about that [06:39:48] (03PS1) 10KartikMistry: Revert "Update cxserver to 2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537 [06:40:52] marostegui: m5 not accessible by cxserver ie something with network policy in my patch. [06:42:10] (03PS1) 10Ayounsi: Homer: add REQUESTS_CA_BUNDLE for new Netbox endpoint [puppet] - 10https://gerrit.wikimedia.org/r/803858 (https://phabricator.wikimedia.org/T296452) [06:43:36] (03CR) 10Ayounsi: [C: 03+2] Homer: add REQUESTS_CA_BUNDLE for new Netbox endpoint [puppet] - 10https://gerrit.wikimedia.org/r/803858 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [06:44:35] (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to 2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537 (owner: 10KartikMistry) [06:44:38] kart_: I guess firewalls? [06:46:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:46:46] I'm not sure how to handle that. akosiaris can you look when around? [06:46:59] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [06:46:59] ource sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [06:47:37] (03Merged) 10jenkins-bot: Revert "Update cxserver to 2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537 (owner: 10KartikMistry) [06:47:44] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/803253 (owner: 10L10n-bot) [06:48:26] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:57] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:39] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:54] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P29499 and previous config saved to /var/cache/conftool/dbconfig/20220608-065256-marostegui.json [06:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:07] Reverted patch and deployed, but seems now sqlite DB can't be open by cxserver. That's strange! [06:55:32] `SQLITE_CANTOPEN: unable to open database file` at: https://cxserver.wikimedia.org/v2/suggest/sections/Zakir_Hussain_(musician)/en/ml [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:03] I'm here and I'll need sticker from previous deployment of cxserver :D [07:05:13] (03CR) 10KartikMistry: [C: 03+2] Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793) (owner: 10KartikMistry) [07:05:22] ^ will deploy this. [07:07:12] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:15] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] ^ was testing if I've deployed properly or not. [07:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29500 and previous config saved to /var/cache/conftool/dbconfig/20220608-070801-marostegui.json [07:08:02] akosiaris: I've reverted patch, but seems few config is not updated. What can be reason(s)? [07:08:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:08:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:07] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29501 and previous config saved to /var/cache/conftool/dbconfig/20220608-070809-marostegui.json [07:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:49] kart_: From which server would you connect from? [07:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29502 and previous config saved to /var/cache/conftool/dbconfig/20220608-071430-marostegui.json [07:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:35] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:16:18] marostegui: cxserver from Production: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values.yaml#85 was updated there. [07:17:09] kart_: Yeah I mean if you have a hostname for me to test the connection manually [07:18:31] !log imported openjdk 8u332-ga-1~deb11u1 to apt.wikimedia.org/bullseye-wikimedia (rebuild of latest Java security fixes for Bullseye) [07:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:59] marostegui: no idea if cxserver pods can be tested. [07:19:18] marostegui: as service runs on deployment-charts with docker [07:19:57] kart_: Ah ok, let's see if akosiaris can help here then :) [07:20:23] My another issue is - why config revert is not reflected after deployment :/ [07:20:33] marostegui: yeah, will wait for him. [07:20:44] kart_: I am reviewing the DB and the grants just in case [07:20:58] marostegui: OK. Thanks! [07:21:04] (03Merged) 10jenkins-bot: Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793) (owner: 10KartikMistry) [07:21:16] OK. Time to deploy another fix! [07:21:30] kart_: The grants are pretty wide in terms of allowed networks, so it is probably as you said, some firewall rules missing I guess [07:21:57] !log imported cassandra 3.11.13 to component/cassandradev T309878 [07:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:01] T309878: Import Debian package of Cassandra 3.11.13 as 'dev' version - https://phabricator.wikimedia.org/T309878 [07:27:46] marostegui: Thanks for checking! I'll need help for firewalls then.. Looking at other examples with m5 access. [07:29:16] kart_: Maybe moritzm can help, I recall he helped someone else with firewall accesses to misc clusters :) [07:29:23] Morning moritzm :p [07:29:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P29503 and previous config saved to /var/cache/conftool/dbconfig/20220608-072935-marostegui.json [07:29:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:20] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/UniversalLanguageSelector/extension.json: Backport: [[gerrit:803536|Add explicit dependency to oojs RL module (T309793)]] (duration: 03m 31s) [07:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:23] T309793: Unexpected OOUI payload on page views (+70KB JS transfer size since 2022-04-14) - https://phabricator.wikimedia.org/T309793 [07:31:02] (03PS6) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [07:31:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29504 and previous config saved to /var/cache/conftool/dbconfig/20220608-073132-root.json [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:37] T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114 [07:32:54] (03PS1) 10Marostegui: db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803862 (https://phabricator.wikimedia.org/T310114) [07:33:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:33:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:59] (03CR) 10Marostegui: [C: 03+2] db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803862 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui) [07:35:22] kart_: sure, what needs access to where? [07:36:54] moritzm: cxserver access to m5 hosted cxserverdb. [07:37:09] moritzm: see: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663 [07:37:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:39] moritzm: specially: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values.yaml sets network policy. [07:39:14] that's some k8s specific configuration knob, not familiar with it, this will need someone from service SRE to have a look into [07:40:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1055.eqiad.wmnet with OS bullseye [07:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:33] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1055.eqiad.wmnet with OS bullseye [07:40:35] moritzm: OK! [07:41:34] moritzm: Also, any idea why my revert of deployment-charts patch not reflected yet? Config still can't find sqlite DB, while configuration is reverted and deployed. [07:42:39] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [07:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P29505 and previous config saved to /var/cache/conftool/dbconfig/20220608-074440-marostegui.json [07:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:00] likewise, this will need some help from service SRE [07:46:06] OK! [07:46:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [07:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:43] !log adding additional disk for /srv to webperf1004 T305460 [07:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:46] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [07:52:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1055.eqiad.wmnet with reason: host reimage [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed re: running in codfw, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [07:56:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1055.eqiad.wmnet with reason: host reimage [07:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35790/console" [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29506 and previous config saved to /var/cache/conftool/dbconfig/20220608-075947-marostegui.json [07:59:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:59:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:53] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:00] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: generate per-service TCP blackbox module [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:01:34] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: set SNI for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/803554 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:03:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:03:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29507 and previous config saved to /var/cache/conftool/dbconfig/20220608-080358-marostegui.json [08:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:00] (03PS1) 10Ayounsi: Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866 [08:09:26] (03PS2) 10Ayounsi: Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866 [08:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29508 and previous config saved to /var/cache/conftool/dbconfig/20220608-081025-marostegui.json [08:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:29] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:13:28] (03CR) 10Ayounsi: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudsw2-c8-eqiad.mgmt&service=BGP+status" [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi) [08:13:58] ACKNOWLEDGEMENT - BGP status on cloudsw2-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/803866 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:13:58] ACKNOWLEDGEMENT - BGP status on cloudsw2-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/803866 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:14:22] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:01] ^ That was me checking status on the eqiad for release. [08:21:04] (03CR) 10Filippo Giunchedi: "LGTM overall, thanks for tackling this! See inline" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [08:22:49] (03CR) 10Filippo Giunchedi: [C: 03+2] tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:22:52] (03CR) 10Filippo Giunchedi: [C: 03+2] Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:22:55] (03CR) 10Filippo Giunchedi: [C: 03+2] Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:22:58] (03CR) 10Filippo Giunchedi: [C: 03+2] Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:23:34] (03Merged) 10jenkins-bot: Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:23:39] (03Merged) 10jenkins-bot: tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:23:46] (03Merged) 10jenkins-bot: Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:23:48] (03Merged) 10jenkins-bot: Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [08:25:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P29509 and previous config saved to /var/cache/conftool/dbconfig/20220608-082531-marostegui.json [08:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P29510 and previous config saved to /var/cache/conftool/dbconfig/20220608-084036-marostegui.json [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:41] !log jnuche@deploy1002 Installing scap version "4.9.0" for 540 hosts [08:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:01] !log jnuche@deploy1002 Installation of scap version "4.9.0" completed for 540 hosts [08:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1055.eqiad.wmnet with OS bullseye [08:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1055.eqiad.wmnet with OS bullseye completed: - ms-be1055 (**PASS**) - Downtim... [08:49:19] (03PS1) 10KartikMistry: Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505) [08:50:10] (03PS1) 10Filippo Giunchedi: Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 [08:50:12] (03PS1) 10Filippo Giunchedi: New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) [08:50:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1056.eqiad.wmnet with OS bullseye [08:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:54] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1056.eqiad.wmnet with OS bullseye [08:55:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29511 and previous config saved to /var/cache/conftool/dbconfig/20220608-085541-marostegui.json [08:55:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:55:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:46] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:55:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29512 and previous config saved to /var/cache/conftool/dbconfig/20220608-085554-marostegui.json [08:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:20] (03PS3) 10Muehlenhoff: Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) [08:59:40] * kart_ deploying cxserver to test old config issue. Let's see how it goes now.. [09:00:23] (03CR) 10Muehlenhoff: [C: 03+2] Failover active IDP nodes to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802542 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [09:01:31] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505) (owner: 10KartikMistry) [09:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29513 and previous config saved to /var/cache/conftool/dbconfig/20220608-090201-marostegui.json [09:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:06] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:03:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1056.eqiad.wmnet with reason: host reimage [09:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:12] (03CR) 10Elukey: "Ah snap sorry! Thanks for the follow up!" [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) (owner: 10Cathal Mooney) [09:04:47] (03Merged) 10jenkins-bot: Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505) (owner: 10KartikMistry) [09:06:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1056.eqiad.wmnet with reason: host reimage [09:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:30] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:05] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] RECOVERY - DPKG on deneb is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:09:50] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:48] (03PS1) 10Slyngshede: aptrepo::repo allow notification subject to be changed. [puppet] - 10https://gerrit.wikimedia.org/r/803872 [09:12:55] ah eqiad diff still shows 2 patches behind! [09:13:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35793/console" [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede) [09:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29514 and previous config saved to /var/cache/conftool/dbconfig/20220608-091331-root.json [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:36] T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114 [09:13:36] (03CR) 10Filippo Giunchedi: [C: 04-1] "Blocked on I425d869085" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P29515 and previous config saved to /var/cache/conftool/dbconfig/20220608-091706-marostegui.json [09:17:07] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:57] kart_: o/ [09:18:03] I am around now, what is the issue? [09:18:05] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] akosiaris: weird issues :) [09:19:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1056.eqiad.wmnet with OS bullseye [09:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:34] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1056.eqiad.wmnet with OS bullseye completed: - ms-be1056 (**PASS**) - Downtim... [09:19:45] akosiaris: Can you run `helmfile -e eqiad status` and then `helmfile -e codfw status` and see why two show differences for release? [09:20:20] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:20:33] kart_: the revision you mean? 16 vs 18? [09:20:40] akosiaris: yes. [09:20:47] akosiaris: both should be on 18. [09:20:55] every deployment is a revision, so there have just been more deployments in codfw [09:20:57] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [09:21:05] it's normal that they diverge [09:21:33] akosiaris: ok. That solves first doubt. [09:22:19] what's the next one? [09:22:57] akosiaris: I deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663 and it couldn't connect to Database. So, I reverted it with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803537 and deployed it. [09:23:17] akosiaris: although, now cxserver can't find old config it seems. [09:23:31] akosiaris: see: eg. https://cxserver.wikimedia.org/v2/suggest/sections/Gujarat/en/gu [09:23:36] ah, it's still using the new chart version, 0.1.2 [09:24:12] you can pin the chart version in helmfile to bypass that issue for now, but more importantly, why was it not able to connect to the database? [09:24:17] what was the error? [09:25:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede) [09:25:32] akosiaris: Earlier error: `{"status":500,"type":"internal_error","title":"Error","detail":"connect ECONNREFUSED 127.0.0.1:3306","method":"GET","uri":"/v2/suggest/sections/Gujarat/en/gu"}` [09:26:16] akosiaris: how do I fix chart version as of now? Section Translation is broken without Sqlite DB as of now. [09:26:27] (Will note this down for future ref!) [09:27:35] akosiaris: marostegui checked for grant etc and it was OK. [09:29:49] (03PS1) 10Alexandros Kosiaris: cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873 [09:29:54] kart_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803873 [09:30:01] review, merge and deploy please :-) [09:30:09] Sure! [09:30:10] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the [09:30:10] ted status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [09:31:04] Else ^^ :/ [09:31:59] akosiaris: did I miss anything in network policy in earlier patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663 [09:32:09] (03PS1) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874 [09:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P29516 and previous config saved to /var/cache/conftool/dbconfig/20220608-093211-marostegui.json [09:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:17] kart_: the error is "connect ECONNREFUSED 127.0.0.1:3306" [09:33:31] so the config is probably wrong, it's trying to connect to localhost for some reason [09:33:48] they policy hadn't even begun to matter [09:33:50] the* [09:33:53] (03CR) 10Muehlenhoff: [C: 03+2] Update spec file to use new bullseye nodes [puppet] - 10https://gerrit.wikimedia.org/r/802543 (owner: 10Muehlenhoff) [09:34:20] (03CR) 10KartikMistry: [C: 03+2] cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873 (owner: 10Alexandros Kosiaris) [09:35:43] kart_: deploy to codfw and eqiad to stop the bleeding and then let's use the staging environment to figure out what happened. We we can debug with less pressure there [09:36:06] (03PS1) 10Jbond: C:apereo_cas: fix whitespace in config file [puppet] - 10https://gerrit.wikimedia.org/r/803875 [09:36:33] akosiaris: sure! [09:36:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35795/console" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (owner: 10Jbond) [09:37:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35796/console" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede) [09:37:21] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [09:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:38] (03Merged) 10jenkins-bot: cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873 (owner: 10Alexandros Kosiaris) [09:38:59] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:16] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:45] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:40:52] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:41:16] nice [09:41:29] akosiaris: deployed. Thanks!! [09:41:51] akosiaris: Wondering why chart wasn't reverted with deployment-charts patch? Any specific reason? [09:43:10] kart_: helmfile will always pick the highest version that exists. So revert a chart version requires the pinning that we did above [09:43:19] reverting* [09:43:32] OK. Noting this down! [09:43:34] now more to the debugging aspect of it [09:44:28] in staging I see [09:44:30] sectionmapping: [09:44:30] database: cxserverdb [09:44:30] type: mysql [09:44:35] staging still runs 0.1.2 btw [09:44:48] so the config for some reason hasn't picked up the databases needed [09:46:52] !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: T309526 - btullis@cumin1001 [09:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29517 and previous config saved to /var/cache/conftool/dbconfig/20220608-094716-marostegui.json [09:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:47:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:47:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [09:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [09:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:33] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] aptrepo::repo allow notification subject to be changed. [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede) [09:47:38] (03CR) 10Samtar: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [09:47:50] (03PS2) 10Samtar: changeprop: Modify page denylist [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359) [09:49:15] akosiaris: Also, host is set in per environments ie https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values-codfw.yaml [09:49:22] Is this OK? [09:49:33] yeah, that was always the idea [09:49:43] but note that we haven't set a password, have we? [09:49:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:49:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29518 and previous config saved to /var/cache/conftool/dbconfig/20220608-094952-marostegui.json [09:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:00] akosiaris: That's done by Amir1. [09:51:45] kart_: ah I see it on deploy1002, but it's the wrong section in the yaml file I think [09:51:58] !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs: T309526 - btullis@cumin1001 [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] akosiaris: ouch and probably not done for staging also? Is that OK? [09:53:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:53:30] yup, not done for staging and no, it's not ok. Let me fix that [09:54:02] ah. Yaml spacing :/ [09:55:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [09:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] (03PS1) 10Hnowlan: service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878 [09:56:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29519 and previous config saved to /var/cache/conftool/dbconfig/20220608-095635-marostegui.json [09:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:38] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:57:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:03] kart_: gonna create a new chart version to accomodate for all that, I 'll post a patch in ~10m [09:59:08] akosiaris: cool. Thanks! [10:02:53] !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs100*: T309526 - btullis@cumin1001 [10:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:59] (03PS1) 10Alexandros Kosiaris: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 [10:09:04] (03CR) 10JMeybohm: [C: 03+1] Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi) [10:09:15] (03CR) 10JMeybohm: [C: 03+1] New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [10:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P29520 and previous config saved to /var/cache/conftool/dbconfig/20220608-101140-marostegui.json [10:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:28] !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching aqs100*: T309526 - btullis@cumin1001 [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:16] (03PS1) 10Muehlenhoff: profile::mariadb::ferm_misc: Remove old buster IDP nodes [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214) [10:16:57] (03PS2) 10Jbond: C:apereo_cas: Disable u2f by default [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) [10:18:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35797/console" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond) [10:18:51] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) This is pretty much done. We currently only have two main metrics for the service so there's a very ba... [10:20:00] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:20:40] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) 05Open→03Resolved a:03hnowlan [10:20:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond) [10:21:32] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:21:44] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:23:50] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [10:26:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P29521 and previous config saved to /var/cache/conftool/dbconfig/20220608-102645-marostegui.json [10:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:04] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 59359 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [10:28:12] akosiaris: Thanks. Looking at the patch.. [10:32:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf1004.eqiad.wmnet [10:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:07] (03PS2) 10Alexandros Kosiaris: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 [10:37:24] kart_: found a bug, fixed. patchset #2 look ok to me though [10:38:01] (03CR) 10Samtar: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar) [10:38:22] (03PS1) 10Alexandros Kosiaris: cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887 [10:38:42] I 've also uploaded the chart version pinning ^ [10:38:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf1004.eqiad.wmnet [10:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:24] akosiaris: cool. [10:40:57] kart_: wanna give a +1 and try it out in staging ? [10:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:41:09] actually a +2, not a +1 [10:41:24] (03CR) 10KartikMistry: [C: 03+1] cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris) [10:41:28] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:41:35] ah. [10:41:37] :) [10:41:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29522 and previous config saved to /var/cache/conftool/dbconfig/20220608-104150-marostegui.json [10:41:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:41:54] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:01] Should I also +2 on chart pinning? [10:42:20] (03CR) 10KartikMistry: [C: 03+2] cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris) [10:42:26] kart_: leave that for after we 've tested in staging and deem my change ok [10:42:37] OK! [10:42:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:43:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet [10:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:44] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:44:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet [10:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:22] (03Merged) 10jenkins-bot: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris) [10:46:16] And, now I should deploy in staging, akosiaris? [10:46:46] kart_: 👍 [10:48:06] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [10:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [10:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:apereo_cas: Disable u2f by default [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond) [10:50:44] akosiaris: done. give me few minutes, will be brb. [10:52:44] (03PS1) 10Volans: sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 [10:53:45] kart_: I see curl https://staging.svc.eqiad.wmnet:4002/v2/suggest/sections/Gujarat/en/gu from deploy1002 works fine [10:54:15] so, I 'd say +2 the revert of the chart version pinning and proceed with eqiad/codfw [10:54:31] * akosiaris off for ~1h [10:55:33] (03CR) 10MVernon: [C: 03+1] "LGTM thanks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans) [10:55:50] (03CR) 10Volans: [C: 03+2] sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans) [10:57:42] (03CR) 10Jbond: "lgtm couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede) [10:58:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:59:28] (03Merged) 10jenkins-bot: sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans) [11:01:45] akosiaris: nice!! [11:02:40] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:32] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1060 [11:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:56] (03CR) 10KartikMistry: [C: 03+2] cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887 (owner: 10Alexandros Kosiaris) [11:04:32] (03PS2) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874 [11:04:36] akosiaris: and, I should deploy in staging also? [11:04:45] (03CR) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede) [11:06:50] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede) [11:07:07] (03Merged) 10jenkins-bot: cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887 (owner: 10Alexandros Kosiaris) [11:07:51] (03CR) 10Slyngshede: [C: 03+2] profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede) [11:10:09] oh that's facepalm :D [11:11:14] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:43] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:30] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:23] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:17:11] akosiaris: Thanks a lot. Main issue is solved, now API result is coming with unrelated data, that's separate issue to solve for developers I guess :) [11:20:23] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-ssds (exit_code=99) for host ms-be1060 [11:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:22:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29524 and previous config saved to /var/cache/conftool/dbconfig/20220608-112233-marostegui.json [11:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:37] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:23:44] (03PS7) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [11:25:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "Idea LGTM, see inline tho" [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan) [11:26:34] (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [11:26:59] (03PS2) 10Hnowlan: service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878 [11:27:32] (03CR) 10Hnowlan: service: configure image-suggestion probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan) [11:27:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan) [11:28:40] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:30:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:31:56] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi) [11:33:39] (03CR) 10Hnowlan: [C: 03+2] service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan) [11:33:49] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10faidon) I see on the list a EX4300-48T-AFI. That's likely a mistake -- it should not be that old (= old, but not 8 years old) and we have dozens of these still in production, so keeping it in our spares ma... [11:34:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29525 and previous config saved to /var/cache/conftool/dbconfig/20220608-113419-marostegui.json [11:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:25] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:36:44] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:38:18] (03PS2) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [11:39:15] (03PS1) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) [11:43:40] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:45:20] (03PS1) 10MSantos: Re-enable OSM sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/803893 [11:46:34] marostegui: How can I can access m5-master database, seems not accessible via mwmaint and sql.php access. [11:47:11] marostegui: need to know datatypes of columns for cxserverdb [11:49:14] kart_: mmm I don't think you can access it with those scripts [11:49:21] As those are MW related as far as I know [11:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P29526 and previous config saved to /var/cache/conftool/dbconfig/20220608-114924-marostegui.json [11:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:32] kart_: I can provide those though [11:50:12] marostegui: ie Result of: `SHOW COLUMNS FROM titles;` on m5-master. You can DM me. [11:50:57] (03CR) 10Cathal Mooney: [C: 03+2] Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi) [11:51:01] !log installing django security updates [11:51:26] kart_: https://phabricator.wikimedia.org/P29527 [11:51:34] Let me see though if you can access the host yourself in some other way [11:51:55] !log jnuche@deploy1002 install-world aborted: (duration: 00m 02s) [11:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:08] !log jnuche@deploy1002 Installing scap version "4.9.1" for 540 hosts [11:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:27] !log jnuche@deploy1002 Installation of scap version "4.9.1" completed for 540 hosts [11:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:30] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [11:52:48] marostegui: Thanks!! [11:52:54] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done. [11:57:09] (03PS1) 10MSantos: add maps beta to dsh targets [puppet] - 10https://gerrit.wikimedia.org/r/803894 [11:59:33] (03CR) 10Muehlenhoff: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [12:00:28] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:01:36] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1064 [12:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P29528 and previous config saved to /var/cache/conftool/dbconfig/20220608-120429-marostegui.json [12:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:07] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.swift.convert-ssds (exit_code=97) for host ms-be1064 [12:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:51] (03PS3) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [12:14:58] (03CR) 10CI reject: [V: 04-1] WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [12:19:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29529 and previous config saved to /var/cache/conftool/dbconfig/20220608-121934-marostegui.json [12:19:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:19:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29530 and previous config saved to /var/cache/conftool/dbconfig/20220608-121942-marostegui.json [12:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:36] (03PS4) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [12:22:30] (03CR) 10CI reject: [V: 04-1] WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [12:23:26] (03PS5) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [12:28:19] !log installing rsyslog security updates on Buster [12:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:29] (03PS6) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [12:29:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35799/console" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede) [12:33:19] (03PS7) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [12:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29531 and previous config saved to /var/cache/conftool/dbconfig/20220608-123320-marostegui.json [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:25] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:36:12] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) [12:37:58] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:42:07] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [12:47:40] (03PS3) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [12:48:12] (03PS1) 10KartikMistry: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) [12:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P29532 and previous config saved to /var/cache/conftool/dbconfig/20220608-124825-marostegui.json [12:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:59] (03CR) 10Slyngshede: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [12:56:32] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE) @Addshore Does it mean we need to then de-abandon that change, or should we just create a new patch to r... [12:59:31] (03CR) 10Jelto: "thanks for preparing the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:35] * urbanecm waves [13:00:40] Lucas_WMDE: i guess you'll self-serve? [13:00:45] yup [13:00:54] * Lucas_WMDE looks up what I had scheduled ^^ [13:01:04] ah yes [13:01:12] the big scary ’un [13:01:37] (socially scary, not technically scary – little risk of the site going down :D) [13:01:38] (03PS22) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:02:34] (03CR) 10CI reject: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:02:41] (03CR) 10Filippo Giunchedi: "I think the latest PS is good to merge, thank you John for your patience and assistance!" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:03:18] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P29533 and previous config saved to /var/cache/conftool/dbconfig/20220608-130330-marostegui.json [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:48] (03PS4) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) [13:06:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "one +1, no complaints here or on Phabricator, should be good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [13:07:00] (03Merged) 10jenkins-bot: Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [13:08:09] new enwiki.png looks good on mwdebug1001, syncing [13:09:25] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [13:10:43] is there a scap option to skip php-fpm-restart? [13:10:57] I doubt these restarts are actually needed when I’m syncing a YAML or PNG file [13:11:07] (03PS1) 10Filippo Giunchedi: sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) [13:11:08] * urbanecm doesn't know of any [13:11:19] * urbanecm is also not happy that sync-file takes 3 times more than it used to be [13:11:27] ok [13:11:41] k8s will fix all of that ;) [13:12:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (1/3, no-op) (duration: 03m 32s) [13:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:40] T309544: enwiki.png slightly inconsistent with dewiki.png, enwiki-2x.png, dewiki-2x.png - https://phabricator.wikimedia.org/T309544 [13:12:48] (03CR) 10CI reject: [V: 04-1] sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:12:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:21] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2029.codfw.wmnet [13:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:16:32] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (2/3, no-op) (duration: 03m 35s) [13:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:26] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29534 and previous config saved to /var/cache/conftool/dbconfig/20220608-131836-marostegui.json [13:18:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:18:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:42] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29535 and previous config saved to /var/cache/conftool/dbconfig/20220608-131844-marostegui.json [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2029.codfw.wmnet [13:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:26] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/enwiki.png: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (3/3, needs subsequent purge) (duration: 03m 44s) [13:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] T309544: enwiki.png slightly inconsistent with dewiki.png, enwiki-2x.png, dewiki-2x.png - https://phabricator.wikimedia.org/T309544 [13:22:00] !log lucaswerkmeister-wmde@mwmaint1002:~$ echo 'https://en.wikipedia.org/static/images/project-logos/enwiki.png' | mwscript purgeList.php # T309544 [13:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:40] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:24:55] !log installing rsyslog security updates on Bullseye [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:44] !log UTC afternoon backport+config window done [13:25:44] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:11] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2030.codfw.wmnet [13:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:56] (03PS2) 10Filippo Giunchedi: sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) [13:29:30] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:31:24] (03CR) 10Muehlenhoff: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [13:32:07] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2030.codfw.wmnet [13:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29536 and previous config saved to /var/cache/conftool/dbconfig/20220608-133420-marostegui.json [13:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:26] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:35:56] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10Jclark-ctr) >>! In T307140#7988395, @faidon wrote: > I see on the list a EX4300-48T-AFI. That's likely a mistake -- it should not be that old (= old, but not 8 years old) and we have dozens of these still... [13:37:09] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2031.codfw.wmnet [13:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:23] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [13:37:54] !log installing apache-log4j1.2 security updates [13:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:50] (03PS5) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [13:41:16] (03PS4) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [13:41:19] (03PS1) 10Eevans: Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) [13:41:21] (03PS1) 10Lucas Werkmeister (WMDE): Use absolute namespace in Profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) [13:41:50] (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [13:42:11] (03CR) 10CI reject: [V: 04-1] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [13:42:40] (03PS5) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [13:43:56] (03CR) 10Lucas Werkmeister (WMDE): "Or it could just be ServiceConfig::class, I suppose, since it’s the same namespace." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE)) [13:44:53] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2031.codfw.wmnet [13:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:24] !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08]: Regular analytics weekly train [analytics/refinery@64ddb08] [13:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] (03CR) 10Jbond: [C: 03+1] Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [13:49:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P29537 and previous config saved to /var/cache/conftool/dbconfig/20220608-134925-marostegui.json [13:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2032.codfw.wmnet [13:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:43] !log volans@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1064 [13:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:15] (03CR) 10Muehlenhoff: [C: 03+2] backup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801631 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:54:50] !log volans@cumin2002 END (FAIL) - Cookbook sre.swift.convert-ssds (exit_code=99) for host ms-be1064 [13:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:48] (03PS2) 10Muehlenhoff: exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) [13:56:07] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2032.codfw.wmnet [13:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:45] HI all, is it ok if I make a last-minute addition to the backport window? [13:57:25] I'm just going to be deploying a portal update for some fundraising banners https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/803552 [13:57:27] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10ssingh) [13:57:53] looks like there's nothing happening deployment wise right now [13:58:37] jan_drewniak: feel free to deploy [13:58:52] k thanks [13:58:59] (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) [13:59:31] (03PS3) 10Ssingh: dnsdist: add support for retaining capabilites after startup [puppet] - 10https://gerrit.wikimedia.org/r/784270 [14:00:03] (03CR) 10Btullis: [C: 03+1] "I'm happy with this. Would you like me to +2 and merge?" [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [14:00:33] ACKNOWLEDGEMENT - MD RAID on ms-be1064 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310160 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:00:37] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310160 (10ops-monitoring-bot) [14:00:54] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:01:08] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2033.codfw.wmnet [14:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:10] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:02:30] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:03:00] PROBLEM - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.189 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:03:28] PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P29538 and previous config saved to /var/cache/conftool/dbconfig/20220608-140430-marostegui.json [14:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:06] RECOVERY - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.189 port 9042 https://phabricator.wikimedia.org/T93886 [14:05:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:06:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:17] !log volans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [14:06:18] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be1064.eqiad.wmnet [14:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:09] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:803552| Bumping portals to master (T128546)]] (duration: 03m 30s) [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [14:07:12] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2033.codfw.wmnet [14:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:40] RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.190 port 9042 https://phabricator.wikimedia.org/T93886 [14:09:23] !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08]: Regular analytics weekly train [analytics/refinery@64ddb08] (duration: 22m 59s) [14:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:10:22] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:803552| Bumping portals to master (T128546)]] (duration: 03m 12s) [14:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:14] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2034.codfw.wmnet [14:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:13:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:13:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:13:50] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:16:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2034.codfw.wmnet [14:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29539 and previous config saved to /var/cache/conftool/dbconfig/20220608-141936-marostegui.json [14:19:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:19:40] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:09] (03CR) 10Jbond: [C: 03+1] gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:22:48] !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08] (thin): Regular analytics weekly train THIN [analytics/refinery@64ddb08] [14:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:57] !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08] (thin): Regular analytics weekly train THIN [analytics/refinery@64ddb08] (duration: 00m 09s) [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:12] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2035.codfw.wmnet [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:47] !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@64ddb08] [14:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:50] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:28:56] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2035.codfw.wmnet [14:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:01] (03CR) 10MVernon: [V: 03+2 C: 03+2] Dummy keys and certificates for cassandra (aqs) [labs/private] - 10https://gerrit.wikimedia.org/r/802631 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [14:30:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:30:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [14:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [14:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:00] !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@64ddb08] (duration: 07m 12s) [14:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:34:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29541 and previous config saved to /var/cache/conftool/dbconfig/20220608-143450-marostegui.json [14:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:56] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:36:32] (03PS4) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) [14:37:08] (03CR) 10Muehlenhoff: [C: 03+2] exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:39:52] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2036.codfw.wmnet [14:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:41:09] (03CR) 10Ahmon Dancy: mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [14:42:03] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [14:42:55] ACKNOWLEDGEMENT - MD RAID on ms-be1064 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310181 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:43:00] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10ops-monitoring-bot) [14:44:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2037.codfw.wmnet [14:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight) [14:46:03] I'll do a beta cluster config deployment now. [14:46:06] (03CR) 10Bking: [V: 03+1] Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [14:47:25] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight) [14:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29542 and previous config saved to /var/cache/conftool/dbconfig/20220608-144725-marostegui.json [14:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:30] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:47:59] (03CR) 10Bking: [V: 03+1 C: 03+2] Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson) [14:48:09] (03Merged) 10jenkins-bot: [beta] Switch maps rendering to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight) [14:49:24] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2037.codfw.wmnet [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] (03CR) 10Muehlenhoff: [C: 03+2] gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:50:59] (03PS1) 10Filippo Giunchedi: Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935 [14:51:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:52:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:42] (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [14:53:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] (03PS1) 10Filippo Giunchedi: Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 [14:54:25] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] (03CR) 10MSantos: [C: 03+1] [beta] Switch maps rendering to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight) [14:56:17] (03CR) 10Herron: [C: 03+1] Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935 (owner: 10Filippo Giunchedi) [14:57:02] (03CR) 10Herron: [C: 03+1] Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 (owner: 10Filippo Giunchedi) [14:58:14] I want to set up simple monitoring for the function-* services on the beta cluster (deployment-prep), to alert on #wikipedia-abstract-tech when the service is down. Is there an existing setup I can use as reference? [14:58:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:59:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:28] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [15:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Ping me on IRC tomorrow and then we can deploy." [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [15:01:53] (03CR) 10Filippo Giunchedi: [C: 03+2] Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935 (owner: 10Filippo Giunchedi) [15:02:18] (03CR) 10Filippo Giunchedi: [C: 03+2] Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 (owner: 10Filippo Giunchedi) [15:02:22] (03PS2) 10Filippo Giunchedi: Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 [15:02:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P29543 and previous config saved to /var/cache/conftool/dbconfig/20220608-150230-marostegui.json [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:25] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:05:29] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [15:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:53] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:09:34] (03CR) 10Ahmon Dancy: scap: boostrap freshly provisioned scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [15:10:30] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:27] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10CDanis) a:03KFrancis [15:13:01] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10CDanis) a:03MMiller_WMF [15:13:42] !log trim swift logs older than 30d from centrallog2002 - T309171 [15:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:47] T309171: syslog / centrallog log volume growth - https://phabricator.wikimedia.org/T309171 [15:14:02] ori: I'm not aware of anything similar off the top of my head no [15:14:30] ack, thanks (and hello) [15:14:45] (03PS1) 10Muehlenhoff: noc: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013) [15:14:47] (03PS1) 10Muehlenhoff: wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) [15:14:48] ori: hi! :D [15:15:16] the good news is that for production monitoring is there via "probes" options in service::catalog [15:15:46] not sure if that's the case for function-* though [15:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:17:00] !log trim swift logs older than 30d from centrallog1001 - T309171 [15:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:19] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [15:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P29544 and previous config saved to /var/cache/conftool/dbconfig/20220608-151735-marostegui.json [15:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1060.eqiad.wmnet with OS bullseye [15:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:39] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1060.eqiad.wmnet with OS bullseye [15:23:41] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [15:24:31] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [15:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:30] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) a:03CDanis For now I'll grant `analytics-privatedata-users` and if later it turns out more access is needed, @EBernhardson or @bscarone can re-... [15:28:35] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:29:33] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [15:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:53] (03CR) 10Btullis: [C: 03+2] Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [15:32:19] (03PS1) 10Cwhite: logstash: add php7.2-fpm to mediawiki error,exception processing [puppet] - 10https://gerrit.wikimedia.org/r/803947 (https://phabricator.wikimedia.org/T234565) [15:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29545 and previous config saved to /var/cache/conftool/dbconfig/20220608-153240-marostegui.json [15:32:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [15:32:44] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [15:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29546 and previous config saved to /var/cache/conftool/dbconfig/20220608-153248-marostegui.json [15:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:15] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:11] (03CR) 10Btullis: [C: 03+2] "Oh, there is an error from puppet following merge." [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [15:37:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1060.eqiad.wmnet with reason: host reimage [15:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:21] PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:38:41] (03CR) 10Cwhite: [C: 03+2] logstash: add php7.2-fpm to mediawiki error,exception processing [puppet] - 10https://gerrit.wikimedia.org/r/803947 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:38:56] 10SRE, 10Phabricator, 10serviceops-radar: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Dzahn) 05Open→03Declined something between resolved and declined. please feel free to reopen though if you feel differently about it. [15:38:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29547 and previous config saved to /var/cache/conftool/dbconfig/20220608-153858-marostegui.json [15:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:02] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [15:39:06] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) Noting the following settings from the deployment-prep horizon project puppet config page: ` profile:... [15:40:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1060.eqiad.wmnet with reason: host reimage [15:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:12] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:13] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:28] PROBLEM - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:42:34] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:40] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:42:44] PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:43:38] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:43:46] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:44:16] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:44:16] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:44:42] PROBLEM - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:45:25] jouncebot nowandnext [15:45:25] No deployments scheduled for the next 2 hour(s) and 14 minute(s) [15:45:25] In 2 hour(s) and 14 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800) [15:45:26] In 2 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800) [15:45:48] RECOVERY - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:45:53] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: T309526 btullis [15:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: T309526 btullis [15:45:58] RECOVERY - Checks that the airflow database for airflow research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] RECOVERY - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is OK: TCP OK - 0.000 second response time on 10.64.0.213 port 9042 https://phabricator.wikimedia.org/T93886 [15:48:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10calbon) Approved! [15:49:47] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P29548 and previous config saved to /var/cache/conftool/dbconfig/20220608-155403-marostegui.json [15:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:52] 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) I'm going to change profile::mediawiki::php::restarts::ensure to true and see how things go. [15:54:54] RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886 [15:55:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1060.eqiad.wmnet with OS bullseye [15:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1060.eqiad.wmnet with OS bullseye completed: - ms-be1060 (**PASS**) - Downtim... [16:07:06] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:09:08] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [16:09:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P29549 and previous config saved to /var/cache/conftool/dbconfig/20220608-160908-marostegui.json [16:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:03] (03PS6) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [16:12:50] (03PS7) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [16:13:06] (03CR) 10Slyngshede: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [16:13:10] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [16:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:16] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [16:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [16:13:46] (03CR) 10CI reject: [V: 04-1] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [16:18:48] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [16:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [16:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:14] (03CR) 10Hashar: "I am dropping myself from the reviewers in favor of Jeena. She wrote that script as part of T255835 and knows about ruamel.yaml :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar) [16:23:18] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [16:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:24] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [16:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29550 and previous config saved to /var/cache/conftool/dbconfig/20220608-162413-marostegui.json [16:24:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:24:17] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [16:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29551 and previous config saved to /var/cache/conftool/dbconfig/20220608-162422-marostegui.json [16:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:32] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [16:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:38] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [16:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:53] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [16:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:59] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [16:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:25] PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:27] PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:33] PROBLEM - Host netflow5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:41] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:33:17] PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:39] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:33:43] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:34:01] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:35:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 1.782 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:43] RECOVERY - Host ganeti5003 is UP: PING OK - Packet loss = 0%, RTA = 223.20 ms [16:35:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.373 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:59] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [16:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:06] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [16:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29552 and previous config saved to /var/cache/conftool/dbconfig/20220608-163737-marostegui.json [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:42] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [16:40:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:40:31] RECOVERY - Host netflow5002 is UP: PING OK - Packet loss = 0%, RTA = 224.85 ms [16:40:33] RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 223.35 ms [16:40:33] 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac ftp fetching of firmware updates (either to existing ftp or new solution) - https://phabricator.wikimedia.org/T283771 (10RobH) [16:40:55] RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 246.01 ms [16:41:23] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [16:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:29] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [16:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:49] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:43:41] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:43:47] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:43:49] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:11] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 347, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10nskaggs) @cmooney , for the manual override, https://wikitech.wikimedia.org/wiki/Network_design_-_Eqiad_WMCS_Network_... [16:46:50] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [16:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:57] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [16:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:19] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:50] let me know if you need any help [16:51:38] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [16:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:45] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [16:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:14] (03CR) 10Dzahn: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [16:52:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:39] Huh? [16:52:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P29553 and previous config saved to /var/cache/conftool/dbconfig/20220608-165242-marostegui.json [16:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:47] I'm around [16:53:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:29] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [16:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:36] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:47] PROBLEM - Check systemd state on netflow5002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:10] legoktm: jayme: see #-sre [17:01:25] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) @MoritzMuehlenhoff Thanks for checking in. Because Goran is no longer an employee of WMDE, I should process a new NDA. Would you please provide Goran's pe... [17:02:03] (03PS1) 10RLazarus: shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 [17:02:17] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:02:20] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [17:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:27] (03CR) 10Herron: [C: 03+1] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus) [17:02:49] (03CR) 10JMeybohm: [C: 03+1] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus) [17:03:15] (03CR) 10JHathaway: [C: 03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus) [17:03:37] (03CR) 10Krinkle: mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [17:04:45] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:28] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:06:07] !log dancy@deploy1002 prep aborted: (duration: 24m 59s) [17:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:25] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:06:51] (03CR) 10RLazarus: [C: 03+2] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus) [17:07:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P29554 and previous config saved to /var/cache/conftool/dbconfig/20220608-170747-marostegui.json [17:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:11] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:21] (03PS2) 10Krinkle: Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE)) [17:08:52] (03CR) 10Krinkle: [C: 03+2] Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE)) [17:09:09] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:08] (03Merged) 10jenkins-bot: Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE)) [17:10:12] (03Merged) 10jenkins-bot: shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus) [17:10:49] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.73 ms [17:11:26] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:41] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033) [17:12:41] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:11] !log the above "helmfile -i apply" was canceled [17:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:20] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033) [17:13:20] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [17:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:56] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:39] !log krinkle@deploy1002 Synchronized src/Profiler.php: I534fb954c359c29a3f018eec75f62b4c4bfcd23f (duration: 03m 35s) [17:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:45] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) ganeti3003 firmware updates bios 2.2.11 to 2.14.2 nic 21.40.22.20 to 21.85.21.92 idrac 3.34.34.34 to 5.10.10.00 [17:15:23] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3003.esams.wmnet with OS bullseye [17:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:15:28] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye [17:17:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:34] (03PS1) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 [17:18:36] (03PS1) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 [17:19:32] (03CR) 10CI reject: [V: 04-1] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle) [17:21:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:21:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] (03CR) 10Ori: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [17:21:51] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:39] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29555 and previous config saved to /var/cache/conftool/dbconfig/20220608-172252-marostegui.json [17:22:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:22:57] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [17:22:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29556 and previous config saved to /var/cache/conftool/dbconfig/20220608-172305-marostegui.json [17:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:11] !log hashar@deploy1002 Started deploy [integration/docroot@e810fc7]: Update Wikibase section [17:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:20] !log hashar@deploy1002 Finished deploy [integration/docroot@e810fc7]: Update Wikibase section (duration: 00m 08s) [17:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:44] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:25:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:24] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:25:49] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:26:02] jouncebot: nowandnext [17:26:02] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [17:26:03] In 0 hour(s) and 33 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800) [17:26:03] In 0 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800) [17:26:14] Cool, I'll sling out a Beta-Cluster-only one. [17:26:17] (03CR) 10Jforrester: [C: 03+2] "Oh, oops, yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [17:26:45] James_F: we're still working an issue with shellbox but I think you can proceed, as long as you're not doing anything Score-related [17:26:55] rzl: Yeah, just a `git pull` in /srv [17:27:01] Not even a scap. [17:27:01] (03Merged) 10jenkins-bot: [BETA CLUSTER] Add wikifunctions to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori) [17:27:07] (Done.) [17:27:18] ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:27:18] ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:27:19] ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [17:27:35] thanks [17:28:00] Now we just have to wait for Beta Cluster's update to verify. [17:28:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:20] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:29:49] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:56] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:09] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:14] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:33] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3003.esams.wmnet with reason: host reimage [17:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29558 and previous config saved to /var/cache/conftool/dbconfig/20220608-173536-marostegui.json [17:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:39] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [17:36:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3003.esams.wmnet with reason: host reimage [17:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:48] (03CR) 10Dzahn: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [17:37:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:38:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:51] (03PS2) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 [17:39:53] (03PS2) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 [17:43:56] !log rolled back shellbox main to revision 2 on eqiad, to unstick a stuck upgrade [17:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:52] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:33] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P29559 and previous config saved to /var/cache/conftool/dbconfig/20220608-175041-marostegui.json [17:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:48] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10MMiller_WMF) I approve -- @KStoller-WMF needs access to these tools to analyze data as part of her product management role. [17:52:04] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10MMiller_WMF) a:05MMiller_WMF→03CDanis [17:54:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3003.esams.wmnet with OS bullseye [17:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:16] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye completed: - ganeti3003 (**PASS**) - Downtimed on Icinga/Ale... [17:57:46] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3003 firmware updated and reimaged to bullseye (easy enough to fire the cookbook to reimage post firmware update to ensure the firmware update fixes... [18:00:05] dduvall and jeena: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800). [18:00:05] dduvall and jeena: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800). [18:02:23] !log joal@deploy1002 Started deploy [airflow-dags/analytics@6b368f4]: Update more jobs to spark3 [18:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:37] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@6b368f4]: Update more jobs to spark3 (duration: 00m 13s) [18:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:23] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P29560 and previous config saved to /var/cache/conftool/dbconfig/20220608-180546-marostegui.json [18:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:31] (03CR) 10Krinkle: [C: 04-1] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [18:12:33] (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068) [18:12:35] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:13:35] (03CR) 10Dzahn: [C: 03+2] sre: update renamed otrs role to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:13:40] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:13:43] (03CR) 10Dzahn: [C: 03+2] "ACK, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:15:18] (03CR) 10Dzahn: [C: 03+2] vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:15:25] (03PS3) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) [18:17:01] (03CR) 10Dzahn: "ACK, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:17:27] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.15 refs T308068 [18:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:31] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [18:19:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:26] (03CR) 10RLazarus: "Two questions:" [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [18:20:51] (03PS1) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) [18:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29561 and previous config saved to /var/cache/conftool/dbconfig/20220608-182051-marostegui.json [18:20:52] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.15 refs T308068 (duration: 03m 25s) [18:20:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:20:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:56] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [18:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:29] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Entered https://wikimedia.coupahost.com/easy_form_responses/3234 into coupa for this work, Jin will coordinate with Arzhel via email and hangout for the actual work window. [18:21:59] (03CR) 10Eevans: [C: 03+1] "For posterity sake: This has now been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [18:22:40] see a very large spike of errors on jsonTruncated channel [18:22:44] rolling back [18:23:21] also a handful of db errors for wikinews sites, "Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist" [18:23:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:23:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:36] (03PS1) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) [18:24:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.39.0-wmf.15" [18:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:32:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:27] (03PS1) 10Dduvall: Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972 [18:32:29] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972 (owner: 10Dduvall) [18:33:47] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972 (owner: 10Dduvall) [18:34:08] (03PS1) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 [18:35:49] (03CR) 10CI reject: [V: 04-1] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [18:38:11] !log uprading aqs1010.eqiad.wmnet to Cassandra 3.11.13 (canary) -- T309896 [18:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:16] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [18:38:25] (03PS2) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 [18:39:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:40:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:40:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:30] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) >>! In T238751#7988593, @ItamarWMDE wrote: > @Addshore Does it mean we need to then de-abandon that change... [18:41:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:58] (03CR) 10Joal: [C: 03+1] "Thanks mforns" [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns) [18:45:48] (03CR) 10Jeena Huneidi: "This looks good to me but before merging we need to update to the newest version of ruamel on the integration agents. It's installed via p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar) [18:48:47] (03CR) 10CI reject: [V: 04-1] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [18:49:24] (03CR) 10CI reject: [V: 04-1] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [18:57:56] is there anyone handy that understand our logstash configuration? [18:58:45] herron, in case you are around ^ [18:58:57] (asking you since it says you are on-call; apologies if not) [18:59:07] hey [18:59:22] heya [18:59:51] I just upgraded one Cassandra node, and now instead of the hostname in the logs it says %{HOSTNAME} [19:00:09] I'm sure something must have changed on the Cassandra end, but I noticed that this is the result of a filter: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/logstash/filters/20-filter_logback.conf [19:01:06] do you know how that filter works, or what it's (supposed to be) doing? [19:01:28] (looking at line #8) [19:02:20] https://logstash.wikimedia.org/goto/4ac95436ebea812c70a1b70cfd5338e4 is the upgraded node, and apparently the only one doing this... [19:03:56] I'm guessing that the hostname is no longer being parsed out successfully from the log, so that mutate is replacing host with nothing essentially [19:05:25] is there any easy way of seeing what is *actually* being sent? [19:05:54] emphasis on "easy" :) [19:07:07] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:27] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [19:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:59] urandom: which host was upgraded? [19:09:10] aqs1010? [19:09:14] yes [19:11:50] alright, yeah I think we can do something there to get at the raw logs 1 min [19:14:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:16] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [19:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:26] (03CR) 10Dzahn: [C: 03+2] "safe enough since it just affects 'eqiad1.wikimedia.cloud]'" [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy) [19:19:58] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [19:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:03] urandom: I added a temporary shim that should output raw logs to /tmp/logback_debug.log [19:20:04] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [19:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:24] herron: where is that? [19:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:20:59] herron: should I induce some output? [19:21:04] aqs1010:/tmp/logback_debug.log which is output by rsyslog on the host [19:21:07] yes please [19:22:04] herron: there are a few [19:22:18] herron: and now it's likely to get really chatty [19:23:22] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [19:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:28] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [19:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:15] herron: OK, so 'host' (I guess that was the old property?) wasn't renamed to something, it's just gone altogether [19:25:58] yeah looks like HOSTNAME is missing from the source logs [19:27:54] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [19:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:33] PROBLEM - Apache HTTP on mw1415 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 974 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:31:44] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [19:32:16] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [19:32:46] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [19:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:24] (03CR) 10Dzahn: [C: 03+1] "regarding that single commit from the commit message: I am not sure who the unknown author was but PS1 and PS2 seem to be identical. but w" [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:34:02] jouncebot: now [19:34:02] For the next 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800) [19:34:31] deployers: that 500 Internal server error is from a canary. careful [19:35:20] hmm [19:38:37] urandom: logback_debug.log is looking better now. did the upgrade clobber logback custom fields config or something? [19:39:14] herron: nope, I live-hacked a test fix [19:39:18] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I tested this and it looks good.." [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [19:39:52] herron: I set HOSTNAME as a custom field (using logbacks ${HOSTNAME}) [19:40:06] ah gotcha, so that's new [19:40:49] I'm going to codify that in a changeset so that I don't have to rollback mid-upgrade, but it's probably a work-around [19:41:15] urandom: cool sounds good, yeah seems to be working well enough [19:41:46] upstream Cassandra upgraded logback from 1.1.3 to 1.2.9 (a huge jump), and I'm not even sure we're using the "right" appender anymore [19:42:08] herron: I'll probably prod you for a code review here in a bit :) [19:42:18] I'll leave that logback_debug.log config in place, the next puppet run will undo it [19:42:35] urandom: ok will keep an eye out for it [19:42:51] (03PS1) 10Ahmon Dancy: Revert "scap.cfg.erb: Define php_fpm restart settings for beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/803908 [19:45:07] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "scap.cfg.erb: Define php_fpm restart settings for beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/803908 (owner: 10Ahmon Dancy) [19:46:16] (03PS1) 10Eevans: Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) [19:46:57] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [19:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] PROBLEM - PHP7 rendering on mw1415 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 871 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:49:16] (03CR) 10Eevans: [C: 03+1] "PPC output: https://puppet-compiler.wmflabs.org/pcc-worker1002/35800/" [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [19:49:25] (03CR) 10Herron: [C: 03+1] Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [19:50:09] herron: the old config was a symlink, so I had to break that use a copy of the target (so the diff is large). The puppet compiler output shows the actual (tiny) change. [19:51:30] urandom: got it, lgtm! [19:51:33] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [19:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:14] urandom: want me to merge this? [19:52:21] yes please! [19:52:28] kk doing [19:52:39] (03CR) 10Herron: [C: 03+2] Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans) [19:52:49] herron: thanks for all your help! [19:53:46] urandom: any time! ready for puppet to run on aqs1010 now? [19:54:01] sure (although that I actually can do) :) [19:54:24] ah even better, I will leave you to it then! [19:54:43] for hysterical raisins I have root on those clusters [19:55:21] PROBLEM - mediawiki-installation DSH group on mw1415 is CRITICAL: Host mw1415 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:55:34] herron: I see you have it disable, and assume it's OK to reenable? [19:55:41] *disabled [19:56:22] yup, was just disabled to avoid clobbering the rsyslog logback_debug.log hack, ready to re-enable [19:58:51] !log restarting Cassandra, aqs1010-{a,b}, to apply logback work-around -- T309896 [19:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:55] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [20:00:05] RoanKattouw, Urbanecm, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T2000). [20:00:05] mewoph: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] hi mewoph! I can deploy today. [20:00:48] thanks! we have some unrelated failing tests :( [20:01:20] mewoph: was just going to mention that. do we know why they fail? [20:01:30] (03PS1) 10Krinkle: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) [20:04:36] I see that the alt text is missing in the string comparison in the failing test, but it's from ParserIntegrationTest so that's most likely not related to GrowthExperiments change [20:05:14] mutante: I filed https://phabricator.wikimedia.org/T310225 [20:06:02] mewoph: do you know whether this is also an issue on master, or just in wmf.XX? (didn't do anything in GE today, so not sure myself) [20:08:21] dancy: thanks! ack [20:10:19] (03CR) 10Cwhite: [C: 03+2] logstash: canary curator fork on codfw [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [20:13:26] mewoph: ping re my above message :) [20:13:32] sorry i don't know [20:13:45] okay [20:14:23] we might need to re-schedule this backport :( [20:14:35] yeah, I don't really want to overrule CI without knowing why it fails :/ [20:17:17] urbanecm: it fails because parser integration tests periodically go out of date [20:18:04] urbanecm: see e.g. T265024 [20:18:07] T265024: Parser tests are broken for GrowthExperiments - https://phabricator.wikimedia.org/T265024 [20:18:31] oops, T302964 is a better reference [20:18:31] T302964: ParserIntegrationTest::testParse with data set "parserTests.txt: Media link with nested wikilinks" ('legacy parser') - https://phabricator.wikimedia.org/T302964 [20:18:41] specifically https://phabricator.wikimedia.org/T302964#7750242 [20:20:04] (03CR) 10Herron: [C: 04-2] Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [20:22:45] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/803954 would need to be backported to wmf.14 and wmf.15, I think [20:23:06] but anyway, I think force merging is fine as this issue is unrelated. [20:29:40] 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) @leila I saw on the Research project page you linked that the project lasts through August, so I set an `expiry_date` of Sept 1st 2022 in my patc... [20:31:09] (03PS1) 10CDanis: bscarone: shell/analytics/krb access [puppet] - 10https://gerrit.wikimedia.org/r/803982 (https://phabricator.wikimedia.org/T310021) [20:33:08] !log rolling back group0 as well due to T310214 [20:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:13] T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist - https://phabricator.wikimedia.org/T310214 [20:33:52] (03CR) 10CDanis: [C: 03+2] bscarone: shell/analytics/krb access [puppet] - 10https://gerrit.wikimedia.org/r/803982 (https://phabricator.wikimedia.org/T310021) (owner: 10CDanis) [20:35:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:35:46] (03CR) 10Arlolra: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [20:35:59] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.15" [20:36:00] (03CR) 10Arlolra: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse) [20:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:53] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:36:54] (03PS1) 10Dduvall: Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983 [20:36:56] (03CR) 10Dduvall: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983 (owner: 10Dduvall) [20:37:43] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983 (owner: 10Dduvall) [20:37:48] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) 05Open→03Resolved @bscarone you should now be able to use that SSH key to access production per the shell access instru... [20:41:52] (03PS1) 10CDanis: kstoller analytics access [puppet] - 10https://gerrit.wikimedia.org/r/803984 (https://phabricator.wikimedia.org/T310002) [20:42:27] !log krinkle@mw1415: Run `scap pull` manually ref T310225 [20:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:31] T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 [20:42:39] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10wiki_willy) a:03Cmjohnson [20:42:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:01] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310160 (10wiki_willy) a:03Cmjohnson [20:43:19] 10SRE, 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10wiki_willy) a:03Jclark-ctr [20:43:36] (03CR) 10CDanis: [C: 03+2] kstoller analytics access [puppet] - 10https://gerrit.wikimedia.org/r/803984 (https://phabricator.wikimedia.org/T310002) (owner: 10CDanis) [20:43:47] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10wiki_willy) a:03Cmjohnson [20:43:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:53] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10CDanis) 05Open→03Resolved Access should be live within 30 minutes! Please re-open if you have any trouble. [20:47:54] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10leila) approved. (and the expiration date for access can be set to September 1, 2022.) Thanks! [20:52:29] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [20:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:58:27] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [20:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:08] urbanecm: the wmf14 patch tests are passing again, right at the end of the backport window :/ should we re-schedule or is it still ok to backport now? [21:01:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:03:07] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [8 [21:03:07] 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [21:04:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:07:45] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given sourc [21:07:45] ns) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [21:09:45] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:10:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:19] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:23] RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:14:23] RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:14:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:15:43] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly) [21:16:12] (03PS1) 10Jdlrobson: [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398) [21:17:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:17:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:13] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet [21:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:20:37] RECOVERY - mediawiki-installation DSH group on mw1415 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:23:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:39] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:28:26] ok if i do a quick labs deploy? [21:32:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:33:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 4.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:40:12] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1415.eqiad.wmnet [21:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:10] !log repooled mw1415 after restarting apache and php-fpm, seeing all Icinga alerts recover etc T307755 T310225 [21:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:16] T307755: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 [21:41:16] T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 [21:45:29] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on... [21:46:19] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 05In progress→03Resolved a:03Dzahn [21:47:10] (03CR) 10Clare Ming: [C: 03+2] [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398) (owner: 10Jdlrobson) [21:48:07] (03Merged) 10jenkins-bot: [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398) (owner: 10Jdlrobson) [21:52:28] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:803988|[beta cluster] Enable VectorTitleAboveTabs (T309398)]] (duration: 03m 32s) [21:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:32] T309398: Toolbar styling - https://phabricator.wikimedia.org/T309398 [21:53:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:57] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [22:00:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:00:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:23] (03PS1) 10Bking: elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648) [22:10:59] (03CR) 10Ebernhardson: [C: 03+1] elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking) [22:11:57] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:15:25] (03CR) 10Bking: [C: 03+2] elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking) [22:24:56] mewoph: kostajh: sorry, i was afk as i thought we don't know why it is not failing. let's do it tomorrow please :) [22:25:17] *why it _is_ failing [22:25:32] (03PS1) 10Krinkle: rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214) [22:29:59] urbanecm: thanks! i added it to tomorrow's window [22:30:04] thanks! [22:30:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:32:43] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:39:23] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:42:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:42:07] (03CR) 10Tim Starling: "> maybe a broader solution is needed" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [22:44:21] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:46:17] (03PS1) 10Ryan Kemper: Bump changelog for custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804003 (https://phabricator.wikimedia.org/T309648) [22:46:28] (03PS1) 10Ebernhardson: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 [22:47:23] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Bump changelog for custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804003 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [22:54:49] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:54:51] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 27.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:57:07] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 79.27 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:57:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:57:12] ^ I would expect these to heal soon given that the traffic looks normal again [22:57:16] Oh, just beat me to it [22:58:13] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) a:05Cmjohnson→03Andrew I think this should be assigned to me, to put the new hosts into service. That's currently blocked by a... [23:01:08] (03CR) 10Ryan Kemper: "Great idea. See inline for one small question" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson) [23:02:49] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:06:44] anyone getting dns issues? or just me? [23:07:26] (03CR) 10Ryan Kemper: [C: 03+1] Add a check that deb is unreleased in prepare_commit (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson) [23:07:38] (03PS2) 10Ryan Kemper: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson) [23:08:40] !log T309648 Built `wmf-elasticsearch-search-plugins_6.8.23-3` (https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/804003) following steps in https://phabricator.wikimedia.org/P19522. Result: https://apt.wikimedia.org/wikimedia/pool/component/elastic68/w/wmf-elasticsearch-search-plugins/ [23:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:45] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [23:08:59] Seddon: it wouldn't be Virgin Media would it? not just you but it looks like your ISP [23:09:24] rzl, possibly yes [23:11:08] rzl: I switch my mac to google dns and it all resolved [23:11:21] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [23:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:10] Seddon: good to hear -- browser reports from country=GB are trending down too, so it might have just been good timing [23:12:24] either way, appreciate the report! nothing for us to do as it turned out, but there might be next time [23:12:43] (03Abandoned) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper) [23:15:30] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648 [23:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:35] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [23:15:49] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:25:28] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:25:53] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [23:25:53] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX