[00:01:24] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[00:01:44] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[00:05:22] <icinga-wm>	 PROBLEM - Check systemd state on miscweb2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:42] <icinga-wm>	 PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:59] <wikibugs>	 (03PS1) 10Brennen Bearnes: tag-release.sh: remove submodule force-push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/803621 (https://phabricator.wikimedia.org/T309910)
[00:29:48] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-05-31 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:30:47] <wikibugs>	 (03PS7) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809)
[00:30:49] <wikibugs>	 (03PS6) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836
[00:30:52] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-05-31 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:31:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[00:32:30] <wikibugs>	 (03CR) 10Tim Starling: "In PS7 I made it so that in the secondary data center, connecting to the x2 local master will not use SSL. Configuration overhead increase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[00:33:59] <wikibugs>	 (03PS8) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809)
[00:34:01] <wikibugs>	 (03PS7) 10Tim Starling: Clean up scap sequencing workaround [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836
[00:36:33] <wikibugs>	 (03CR) 10Tim Starling: "PS8: phpcs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling)
[00:56:43] <wikibugs>	 (03CR) 10Brennen Bearnes: "Pushed as part of cleanup of tagging." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/803621 (https://phabricator.wikimedia.org/T309910) (owner: 10Brennen Bearnes)
[01:01:06] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-06-07 00:00:01 (3105 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:06:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:15:34] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:15:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:20:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:20:16] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:33:56] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host clouddumps1001.wikimedia.org
[01:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:06] <wikibugs>	 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10leila)
[01:55:32] <wikibugs>	 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10leila) This request is approved on my end.  (Please note that I'm not sure if other than `analytics-privatedata-users` whether Bruno needs access to another group...
[01:57:54] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:04:48] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-06-07 00:00:01 (3105 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:09:28] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:12:38] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:41:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:22:32] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:27:02] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:28:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) Looks like rising gas prices contributed to the higher freight charges.  The 3x freight options would be with OSF: $3716.76, Pegasus: $5.777.78, and Hollander: $4235.89, so we'll just go ahead...
[03:31:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:52:14] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:36:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:40:59] <wikibugs>	 (03PS1) 10Ebernhardson: Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648)
[05:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:16:22] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:20:38] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:21:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:27:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[05:27:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[05:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29490 and previous config saved to /var/cache/conftool/dbconfig/20220608-052745-marostegui.json
[05:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:49] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[05:32:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29491 and previous config saved to /var/cache/conftool/dbconfig/20220608-053201-marostegui.json
[05:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P29492 and previous config saved to /var/cache/conftool/dbconfig/20220608-054706-marostegui.json
[05:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:13] <wikibugs>	 (03PS1) 10Marostegui: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803742 (https://phabricator.wikimedia.org/T310114)
[05:47:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143 for migration to 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29493 and previous config saved to /var/cache/conftool/dbconfig/20220608-054718-root.json
[05:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:22] <stashbot>	 T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114
[05:48:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803742 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui)
[06:00:08] <wikibugs>	 (03PS1) 10Marostegui: db1143: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/803743 (https://phabricator.wikimedia.org/T310114)
[06:01:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1143: Install mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/803743 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui)
[06:02:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P29494 and previous config saved to /var/cache/conftool/dbconfig/20220608-060211-marostegui.json
[06:02:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:58] <wikibugs>	 (03PS1) 10KartikMistry: Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793)
[06:17:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T310011)', diff saved to https://phabricator.wikimedia.org/P29495 and previous config saved to /var/cache/conftool/dbconfig/20220608-061717-marostegui.json
[06:17:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[06:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[06:17:21] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29496 and previous config saved to /var/cache/conftool/dbconfig/20220608-061724-marostegui.json
[06:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:18] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:20:22] * kart_ updating cxserver.
[06:21:06] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[06:22:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29497 and previous config saved to /var/cache/conftool/dbconfig/20220608-062245-marostegui.json
[06:22:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:49] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:25:13] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[06:27:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:23] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:33] <kart_>	 OK. Looks like that fails..
[06:36:10] <kart_>	 marostegui: What can be reason for `{"status":500,"type":"internal_error","title":"Error","detail":"connect ECONNREFUSED 127.0.0.1:3306","method":"GET","uri":"/v2/suggest/sections/Gujarat/en/gu"}` Ref: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663
[06:37:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P29498 and previous config saved to /var/cache/conftool/dbconfig/20220608-063751-marostegui.json
[06:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:59] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for
[06:37:59] <icinga-wm>	 ource sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[06:39:11] <kart_>	 ah. I'll revert.
[06:39:19] <marostegui>	 kart_: yeah not sure about that 
[06:39:48] <wikibugs>	 (03PS1) 10KartikMistry: Revert "Update cxserver to  2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537
[06:40:52] <kart_>	 marostegui: m5 not accessible by cxserver ie something with network policy in my patch.
[06:42:10] <wikibugs>	 (03PS1) 10Ayounsi: Homer: add REQUESTS_CA_BUNDLE for new Netbox endpoint [puppet] - 10https://gerrit.wikimedia.org/r/803858 (https://phabricator.wikimedia.org/T296452)
[06:43:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Homer: add REQUESTS_CA_BUNDLE for new Netbox endpoint [puppet] - 10https://gerrit.wikimedia.org/r/803858 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[06:44:35] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Revert "Update cxserver to  2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537 (owner: 10KartikMistry)
[06:44:38] <marostegui>	 kart_: I guess firewalls?
[06:46:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:46:46] <kart_>	 I'm not sure how to handle that. akosiaris can you look when around?
[06:46:59] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for
[06:46:59] <icinga-wm>	 ource sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[06:47:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update cxserver to  2022-05-31-123738-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803537 (owner: 10KartikMistry)
[06:47:44] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/803253 (owner: 10L10n-bot)
[06:48:26] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:57] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:39] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:54] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P29499 and previous config saved to /var/cache/conftool/dbconfig/20220608-065256-marostegui.json
[06:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:07] <kart_>	 Reverted patch and deployed, but seems now sqlite DB can't be open by cxserver. That's strange! 
[06:55:32] <kart_>	 `SQLITE_CANTOPEN: unable to open database file` at: https://cxserver.wikimedia.org/v2/suggest/sections/Zakir_Hussain_(musician)/en/ml
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:02:03] <kart_>	 I'm here and I'll need sticker from previous deployment of cxserver :D
[07:05:13] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793) (owner: 10KartikMistry)
[07:05:22] <kart_>	 ^ will deploy this.
[07:07:12] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[07:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:15] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[07:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:37] <kart_>	 ^ was testing if I've deployed properly or not.
[07:08:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29500 and previous config saved to /var/cache/conftool/dbconfig/20220608-070801-marostegui.json
[07:08:02] <kart_>	 akosiaris: I've reverted patch, but seems few config is not updated. What can be reason(s)?
[07:08:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[07:08:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[07:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:07] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29501 and previous config saved to /var/cache/conftool/dbconfig/20220608-070809-marostegui.json
[07:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:49] <marostegui>	 kart_: From which server would you connect from?
[07:14:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29502 and previous config saved to /var/cache/conftool/dbconfig/20220608-071430-marostegui.json
[07:14:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:35] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:16:18] <kart_>	 marostegui: cxserver from Production: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values.yaml#85 was updated there.
[07:17:09] <marostegui>	 kart_: Yeah I mean if you have a hostname for me to test the connection manually
[07:18:31] <moritzm>	 !log imported openjdk 8u332-ga-1~deb11u1 to apt.wikimedia.org/bullseye-wikimedia (rebuild of latest Java security fixes for Bullseye)
[07:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:59] <kart_>	 marostegui: no idea if cxserver pods can be tested.
[07:19:18] <kart_>	 marostegui: as service runs on deployment-charts with docker
[07:19:57] <marostegui>	 kart_: Ah ok, let's see if akosiaris can help here then :)
[07:20:23] <kart_>	 My another issue is - why config revert is not reflected after deployment :/
[07:20:33] <kart_>	 marostegui: yeah, will wait for him.
[07:20:44] <marostegui>	 kart_: I am reviewing the DB and the grants just in case
[07:20:58] <kart_>	 marostegui: OK. Thanks!
[07:21:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add explicit dependency to oojs RL module [extensions/UniversalLanguageSelector] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803536 (https://phabricator.wikimedia.org/T309793) (owner: 10KartikMistry)
[07:21:16] <kart_>	 OK. Time to deploy another fix!
[07:21:30] <marostegui>	 kart_: The grants are pretty wide in terms of allowed networks, so it is probably as you said, some firewall rules missing I guess
[07:21:57] <moritzm>	 !log imported cassandra 3.11.13 to component/cassandradev T309878
[07:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:01] <stashbot>	 T309878: Import Debian package of Cassandra 3.11.13 as 'dev' version - https://phabricator.wikimedia.org/T309878
[07:27:46] <kart_>	 marostegui: Thanks for checking! I'll need help for firewalls then.. Looking at other examples with m5 access.
[07:29:16] <marostegui>	 kart_: Maybe moritzm can help, I recall he helped someone else with firewall accesses to misc clusters :)
[07:29:23] <marostegui>	 Morning moritzm :p
[07:29:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P29503 and previous config saved to /var/cache/conftool/dbconfig/20220608-072935-marostegui.json
[07:29:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:29:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:20] <logmsgbot>	 !log kartik@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/UniversalLanguageSelector/extension.json: Backport: [[gerrit:803536|Add explicit dependency to oojs RL module (T309793)]] (duration: 03m 31s)
[07:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:23] <stashbot>	 T309793: Unexpected OOUI payload on page views (+70KB JS transfer size since 2022-04-14) - https://phabricator.wikimedia.org/T309793
[07:31:02] <wikibugs>	 (03PS6) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673)
[07:31:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29504 and previous config saved to /var/cache/conftool/dbconfig/20220608-073132-root.json
[07:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:37] <stashbot>	 T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114
[07:32:54] <wikibugs>	 (03PS1) 10Marostegui: db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803862 (https://phabricator.wikimedia.org/T310114)
[07:33:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:33:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:33:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/803862 (https://phabricator.wikimedia.org/T310114) (owner: 10Marostegui)
[07:35:22] <moritzm>	 kart_: sure, what needs access to where?
[07:36:54] <kart_>	 moritzm: cxserver access to m5 hosted cxserverdb.
[07:37:09] <kart_>	 moritzm: see: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663
[07:37:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:39] <kart_>	 moritzm: specially: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values.yaml sets network policy.
[07:39:14] <moritzm>	 that's some k8s specific configuration knob, not familiar with it, this will need someone from service SRE to have a look into
[07:40:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1055.eqiad.wmnet with OS bullseye
[07:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:33] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1055.eqiad.wmnet with OS bullseye
[07:40:35] <kart_>	 moritzm: OK! 
[07:41:34] <kart_>	 moritzm: Also, any idea why my revert of deployment-charts patch not reflected yet? Config still can't find sqlite DB, while configuration is reverted and deployed.
[07:42:39] <icinga-wm>	 RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet
[07:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P29505 and previous config saved to /var/cache/conftool/dbconfig/20220608-074440-marostegui.json
[07:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:00] <moritzm>	 likewise, this will need some help from service SRE
[07:46:06] <kart_>	 OK!
[07:46:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet
[07:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:43] <moritzm>	 !log adding additional disk for /srv to webperf1004 T305460
[07:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:46] <stashbot>	 T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460
[07:52:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1055.eqiad.wmnet with reason: host reimage
[07:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed re: running in codfw, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite)
[07:56:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1055.eqiad.wmnet with reason: host reimage
[07:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35790/console" [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:59:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T310011)', diff saved to https://phabricator.wikimedia.org/P29506 and previous config saved to /var/cache/conftool/dbconfig/20220608-075947-marostegui.json
[07:59:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:59:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:53] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: generate per-service TCP blackbox module [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:01:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: set SNI for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/803554 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:03:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[08:03:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[08:03:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29507 and previous config saved to /var/cache/conftool/dbconfig/20220608-080358-marostegui.json
[08:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:00] <wikibugs>	 (03PS1) 10Ayounsi: Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866
[08:09:26] <wikibugs>	 (03PS2) 10Ayounsi: Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866
[08:10:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29508 and previous config saved to /var/cache/conftool/dbconfig/20220608-081025-marostegui.json
[08:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:29] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[08:13:28] <wikibugs>	 (03CR) 10Ayounsi: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudsw2-c8-eqiad.mgmt&service=BGP+status" [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi)
[08:13:58] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cloudsw2-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/803866 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:13:58] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cloudsw2-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist ayounsi https://gerrit.wikimedia.org/r/c/operations/puppet/+/803866 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:14:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:31] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:01] <kart_>	 ^ That was me checking status on the eqiad for release.
[08:21:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, thanks for tackling this! See inline" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[08:22:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:22:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:22:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:22:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:23:34] <wikibugs>	 (03Merged) 10jenkins-bot: Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:23:39] <wikibugs>	 (03Merged) 10jenkins-bot: tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:23:46] <wikibugs>	 (03Merged) 10jenkins-bot: Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:23:48] <wikibugs>	 (03Merged) 10jenkins-bot: Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:25:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P29509 and previous config saved to /var/cache/conftool/dbconfig/20220608-082531-marostegui.json
[08:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P29510 and previous config saved to /var/cache/conftool/dbconfig/20220608-084036-marostegui.json
[08:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:41] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.9.0" for 540 hosts
[08:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:01] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.9.0" completed for 540 hosts
[08:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:45] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1055.eqiad.wmnet with OS bullseye
[08:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:49] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1055.eqiad.wmnet with OS bullseye completed: - ms-be1055 (**PASS**)   - Downtim...
[08:49:19] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505)
[08:50:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870
[08:50:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546)
[08:50:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1056.eqiad.wmnet with OS bullseye
[08:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:54] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1056.eqiad.wmnet with OS bullseye
[08:55:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29511 and previous config saved to /var/cache/conftool/dbconfig/20220608-085541-marostegui.json
[08:55:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:55:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:46] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[08:55:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29512 and previous config saved to /var/cache/conftool/dbconfig/20220608-085554-marostegui.json
[08:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:20] <wikibugs>	 (03PS3) 10Muehlenhoff: Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214)
[08:59:40] * kart_ deploying cxserver to test old config issue. Let's see how it goes now..
[09:00:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover active IDP nodes to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802542 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[09:01:31] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505) (owner: 10KartikMistry)
[09:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:02:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29513 and previous config saved to /var/cache/conftool/dbconfig/20220608-090201-marostegui.json
[09:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:06] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:03:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1056.eqiad.wmnet with reason: host reimage
[09:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:12] <wikibugs>	 (03CR) 10Elukey: "Ah snap sorry! Thanks for the follow up!" [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) (owner: 10Cathal Mooney)
[09:04:47] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-05-31-045829-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803869 (https://phabricator.wikimedia.org/T273505) (owner: 10KartikMistry)
[09:06:05] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1056.eqiad.wmnet with reason: host reimage
[09:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[09:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:05] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[09:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:04] <icinga-wm>	 RECOVERY - DPKG on deneb is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:09:50] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[09:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:37] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[09:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:48] <wikibugs>	 (03PS1) 10Slyngshede: aptrepo::repo allow notification subject to be changed. [puppet] - 10https://gerrit.wikimedia.org/r/803872
[09:12:55] <kart_>	 ah eqiad diff still shows 2 patches behind!
[09:13:02] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35793/console" [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede)
[09:13:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29514 and previous config saved to /var/cache/conftool/dbconfig/20220608-091331-root.json
[09:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:36] <stashbot>	 T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114
[09:13:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Blocked on I425d869085" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:17:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P29515 and previous config saved to /var/cache/conftool/dbconfig/20220608-091706-marostegui.json
[09:17:07] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[09:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:57] <akosiaris>	 kart_: o/
[09:18:03] <akosiaris>	 I am around now, what is the issue? 
[09:18:05] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[09:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:10] <kart_>	 akosiaris: weird issues :)
[09:19:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1056.eqiad.wmnet with OS bullseye
[09:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:34] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1056.eqiad.wmnet with OS bullseye completed: - ms-be1056 (**PASS**)   - Downtim...
[09:19:45] <kart_>	 akosiaris: Can you run `helmfile -e eqiad status` and then `helmfile -e codfw status` and see why two show differences for release?
[09:20:20] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:20:33] <akosiaris>	 kart_: the revision you mean? 16 vs 18? 
[09:20:40] <kart_>	 akosiaris: yes.
[09:20:47] <kart_>	 akosiaris: both should be on 18.
[09:20:55] <akosiaris>	 every deployment is a revision, so there have just been more deployments in codfw 
[09:20:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[09:21:05] <akosiaris>	 it's normal that they diverge
[09:21:33] <kart_>	 akosiaris: ok. That solves first doubt.
[09:22:19] <akosiaris>	 what's the next one?
[09:22:57] <kart_>	 akosiaris: I deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663 and it couldn't connect to Database. So, I reverted it with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803537 and deployed it.
[09:23:17] <kart_>	 akosiaris: although, now cxserver can't find old config it seems.
[09:23:31] <kart_>	 akosiaris: see: eg. https://cxserver.wikimedia.org/v2/suggest/sections/Gujarat/en/gu
[09:23:36] <akosiaris>	 ah, it's still using the new chart version, 0.1.2 
[09:24:12] <akosiaris>	 you can pin the chart version in helmfile to bypass that issue for now, but more importantly, why was it not able to connect to the database?
[09:24:17] <akosiaris>	 what was the error?
[09:25:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede)
[09:25:32] <kart_>	 akosiaris: Earlier error: `{"status":500,"type":"internal_error","title":"Error","detail":"connect ECONNREFUSED 127.0.0.1:3306","method":"GET","uri":"/v2/suggest/sections/Gujarat/en/gu"}`
[09:26:16] <kart_>	 akosiaris: how do I fix chart version as of now? Section Translation is broken without Sqlite DB as of now.
[09:26:27] <kart_>	 (Will note this down for future ref!)
[09:27:35] <kart_>	 akosiaris: marostegui checked for grant etc and it was OK.
[09:29:49] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873
[09:29:54] <akosiaris>	 kart_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/803873
[09:30:01] <akosiaris>	 review, merge and deploy please :-)
[09:30:09] <kart_>	 Sure!
[09:30:10] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the
[09:30:10] <icinga-wm>	 ted status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[09:31:04] <kart_>	 Else ^^ :/
[09:31:59] <kart_>	 akosiaris: did I miss anything in network policy in earlier patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663
[09:32:09] <wikibugs>	 (03PS1) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874
[09:32:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P29516 and previous config saved to /var/cache/conftool/dbconfig/20220608-093211-marostegui.json
[09:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:17] <akosiaris>	 kart_: the error is "connect ECONNREFUSED 127.0.0.1:3306"
[09:33:31] <akosiaris>	 so the config is probably wrong, it's trying to connect to localhost for some reason
[09:33:48] <akosiaris>	 they policy hadn't even begun to matter 
[09:33:50] <akosiaris>	 the*
[09:33:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update spec file to use new bullseye nodes [puppet] - 10https://gerrit.wikimedia.org/r/802543 (owner: 10Muehlenhoff)
[09:34:20] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873 (owner: 10Alexandros Kosiaris)
[09:35:43] <akosiaris>	 kart_: deploy to codfw and eqiad to stop the bleeding and then let's use the staging environment to  figure out what happened. We  we can debug with less pressure there
[09:36:06] <wikibugs>	 (03PS1) 10Jbond: C:apereo_cas: fix whitespace in config file [puppet] - 10https://gerrit.wikimedia.org/r/803875
[09:36:33] <kart_>	 akosiaris: sure!
[09:36:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35795/console" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (owner: 10Jbond)
[09:37:08] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35796/console" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede)
[09:37:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[09:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:38] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: Pin chart version to stop the bleeding [deployment-charts] - 10https://gerrit.wikimedia.org/r/803873 (owner: 10Alexandros Kosiaris)
[09:38:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[09:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:34] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[09:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:16] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[09:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:45] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[09:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:48] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[09:40:52] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[09:41:16] <akosiaris>	 nice
[09:41:29] <kart_>	 akosiaris: deployed. Thanks!!
[09:41:51] <kart_>	 akosiaris: Wondering why chart wasn't reverted with deployment-charts patch? Any specific reason?
[09:43:10] <akosiaris>	 kart_: helmfile will always pick the highest version that exists. So revert a chart version requires the pinning that we did above
[09:43:19] <akosiaris>	 reverting*
[09:43:32] <kart_>	 OK. Noting this down!
[09:43:34] <akosiaris>	 now more to the debugging aspect of it
[09:44:28] <akosiaris>	 in staging I see 
[09:44:30] <akosiaris>	       sectionmapping:
[09:44:30] <akosiaris>	         database: cxserverdb
[09:44:30] <akosiaris>	         type: mysql
[09:44:35] <akosiaris>	 staging still runs 0.1.2 btw
[09:44:48] <akosiaris>	 so the config for some reason hasn't picked up the databases needed
[09:46:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: T309526 - btullis@cumin1001
[09:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T310011)', diff saved to https://phabricator.wikimedia.org/P29517 and previous config saved to /var/cache/conftool/dbconfig/20220608-094716-marostegui.json
[09:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:20] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:47:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[09:47:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[09:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance
[09:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance
[09:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:33] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] aptrepo::repo allow notification subject to be changed. [puppet] - 10https://gerrit.wikimedia.org/r/803872 (owner: 10Slyngshede)
[09:47:38] <wikibugs>	 (03CR) 10Samtar: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar)
[09:47:50] <wikibugs>	 (03PS2) 10Samtar: changeprop: Modify page denylist [deployment-charts] - 10https://gerrit.wikimedia.org/r/803877 (https://phabricator.wikimedia.org/T274359)
[09:49:15] <kart_>	 akosiaris: Also, host is set in per environments ie https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/801663/8/helmfile.d/services/cxserver/values-codfw.yaml
[09:49:22] <kart_>	 Is this OK?
[09:49:33] <akosiaris>	 yeah, that was always the idea
[09:49:43] <akosiaris>	 but note that we haven't set a password, have we?
[09:49:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[09:49:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[09:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29518 and previous config saved to /var/cache/conftool/dbconfig/20220608-094952-marostegui.json
[09:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:00] <kart_>	 akosiaris: That's done by Amir1.
[09:51:45] <akosiaris>	 kart_: ah I see it on deploy1002, but it's the wrong section in the yaml file I think
[09:51:58] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs: T309526 - btullis@cumin1001
[09:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:29] <kart_>	 akosiaris: ouch and probably not done for staging also? Is that OK?
[09:53:30] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:53:30] <akosiaris>	 yup, not done for staging and no, it's not ok. Let me fix that
[09:54:02] <kart_>	 ah. Yaml spacing :/
[09:55:38] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[09:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:44] <wikibugs>	 (03PS1) 10Hnowlan: service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878
[09:56:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29519 and previous config saved to /var/cache/conftool/dbconfig/20220608-095635-marostegui.json
[09:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:38] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:57:52] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:58:03] <akosiaris>	 kart_: gonna create a new chart version to accomodate for all that, I 'll post a patch in ~10m
[09:59:08] <kart_>	 akosiaris: cool. Thanks!
[10:02:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs100*: T309526 - btullis@cumin1001
[10:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:59] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882
[10:09:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi)
[10:09:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[10:11:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P29520 and previous config saved to /var/cache/conftool/dbconfig/20220608-101140-marostegui.json
[10:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:28] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching aqs100*: T309526 - btullis@cumin1001
[10:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:16] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::mariadb::ferm_misc: Remove old buster IDP nodes [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214)
[10:16:57] <wikibugs>	 (03PS2) 10Jbond: C:apereo_cas: Disable u2f by default [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629)
[10:18:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35797/console" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond)
[10:18:51] <wikibugs>	 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) This is pretty much done. We currently only have two main metrics for the service so there's a very ba...
[10:20:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:20:40] <wikibugs>	 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) 05Open→03Resolved a:03hnowlan
[10:20:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond)
[10:21:32] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:21:44] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:23:50] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:26:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P29521 and previous config saved to /var/cache/conftool/dbconfig/20220608-102645-marostegui.json
[10:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:04] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 59359 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[10:28:12] <kart_>	 akosiaris: Thanks. Looking at the patch..
[10:32:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf1004.eqiad.wmnet
[10:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882
[10:37:24] <akosiaris>	 kart_: found a bug, fixed. patchset #2 look ok to me though
[10:38:01] <wikibugs>	 (03CR) 10Samtar: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar)
[10:38:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887
[10:38:42] <akosiaris>	 I 've also uploaded the chart version pinning ^
[10:38:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf1004.eqiad.wmnet
[10:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:24] <kart_>	 akosiaris: cool.
[10:40:57] <akosiaris>	 kart_: wanna give a +1 and try it out in staging ?
[10:41:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:41:09] <akosiaris>	 actually a +2, not a +1 
[10:41:24] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+1] cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris)
[10:41:28] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:41:35] <kart_>	 ah. 
[10:41:37] <kart_>	 :)
[10:41:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T310011)', diff saved to https://phabricator.wikimedia.org/P29522 and previous config saved to /var/cache/conftool/dbconfig/20220608-104150-marostegui.json
[10:41:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:41:54] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[10:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:01] <kart_>	 Should I also +2 on chart pinning?
[10:42:20] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris)
[10:42:26] <akosiaris>	 kart_: leave that for after we 've tested in staging and deem my change ok
[10:42:37] <kart_>	 OK!
[10:42:42] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:43:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet
[10:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:44] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:44:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1004.eqiad.wmnet
[10:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:22] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: Support sectionmapping config [deployment-charts] - 10https://gerrit.wikimedia.org/r/803882 (owner: 10Alexandros Kosiaris)
[10:46:16] <kart_>	 And, now I should deploy in staging, akosiaris?
[10:46:46] <akosiaris>	 kart_: 👍
[10:48:06] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[10:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:26] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[10:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:apereo_cas: Disable u2f by default [puppet] - 10https://gerrit.wikimedia.org/r/803875 (https://phabricator.wikimedia.org/T296629) (owner: 10Jbond)
[10:50:44] <kart_>	 akosiaris: done. give me few minutes, will be brb.
[10:52:44] <wikibugs>	 (03PS1) 10Volans: sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888
[10:53:45] <akosiaris>	 kart_: I see curl https://staging.svc.eqiad.wmnet:4002/v2/suggest/sections/Gujarat/en/gu from deploy1002 works fine
[10:54:15] <akosiaris>	 so, I 'd say +2 the revert of the chart version pinning and proceed with eqiad/codfw
[10:54:31] * akosiaris off for ~1h
[10:55:33] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM thanks :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans)
[10:55:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans)
[10:57:42] <wikibugs>	 (03CR) 10Jbond: "lgtm couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede)
[10:58:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[10:59:28] <wikibugs>	 (03Merged) 10jenkins-bot: sre.swift.convert-ssds: fix logic to skip disks [cookbooks] - 10https://gerrit.wikimedia.org/r/803888 (owner: 10Volans)
[11:01:45] <kart_>	 akosiaris: nice!!
[11:02:40] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:32] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1060
[11:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:56] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887 (owner: 10Alexandros Kosiaris)
[11:04:32] <wikibugs>	 (03PS2) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874
[11:04:36] <kart_>	 akosiaris: and, I should deploy in staging also?
[11:04:45] <wikibugs>	 (03CR) 10Slyngshede: profile::aptrepo::wikimedia Wrapper script for reprepro. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede)
[11:06:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede)
[11:07:07] <wikibugs>	 (03Merged) 10jenkins-bot: cxserver: Remove the chart version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/803887 (owner: 10Alexandros Kosiaris)
[11:07:51] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] profile::aptrepo::wikimedia Wrapper script for reprepro. [puppet] - 10https://gerrit.wikimedia.org/r/803874 (owner: 10Slyngshede)
[11:10:09] <kart_>	 oh that's facepalm :D
[11:11:14] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[11:11:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:43] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[11:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:23] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[11:14:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:17:11] <kart_>	 akosiaris: Thanks a lot. Main issue is solved, now API result is coming with unrelated data, that's separate issue to solve for developers I guess :)
[11:20:23] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-ssds (exit_code=99) for host ms-be1060
[11:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:22:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29524 and previous config saved to /var/cache/conftool/dbconfig/20220608-112233-marostegui.json
[11:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:37] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:23:44] <wikibugs>	 (03PS7) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215
[11:25:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Idea LGTM, see inline tho" [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan)
[11:26:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond)
[11:26:59] <wikibugs>	 (03PS2) 10Hnowlan: service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878
[11:27:32] <wikibugs>	 (03CR) 10Hnowlan: service: configure image-suggestion probes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan)
[11:27:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan)
[11:28:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:30:58] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:31:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi)
[11:33:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: configure image-suggestion probes [puppet] - 10https://gerrit.wikimedia.org/r/803878 (owner: 10Hnowlan)
[11:33:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10faidon) I see on the list a EX4300-48T-AFI. That's likely a mistake -- it should not be that old (= old, but not 8 years old) and we have dozens of these still in production, so keeping it in our spares ma...
[11:34:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29525 and previous config saved to /var/cache/conftool/dbconfig/20220608-113419-marostegui.json
[11:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:25] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:36:44] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:38:18] <wikibugs>	 (03PS2) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[11:39:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214)
[11:43:40] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:45:20] <wikibugs>	 (03PS1) 10MSantos: Re-enable OSM sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/803893
[11:46:34] <kart_>	 marostegui: How can I can access m5-master database, seems not accessible via mwmaint and sql.php access.
[11:47:11] <kart_>	 marostegui: need to know datatypes of columns for cxserverdb
[11:49:14] <marostegui>	 kart_: mmm I don't think you can access it with those scripts
[11:49:21] <marostegui>	 As those are MW related as far as I know
[11:49:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P29526 and previous config saved to /var/cache/conftool/dbconfig/20220608-114924-marostegui.json
[11:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:32] <marostegui>	 kart_: I can provide those though
[11:50:12] <kart_>	 marostegui: ie Result of: `SHOW COLUMNS FROM titles;` on m5-master. You can DM me.
[11:50:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Monitoring: don't check for BGP on cloudsw2 [puppet] - 10https://gerrit.wikimedia.org/r/803866 (owner: 10Ayounsi)
[11:51:01] <moritzm>	  !log installing django security updates
[11:51:26] <marostegui>	 kart_: https://phabricator.wikimedia.org/P29527
[11:51:34] <marostegui>	 Let me see though if you can access the host yourself in some other way
[11:51:55] <logmsgbot>	 !log jnuche@deploy1002 install-world aborted:  (duration: 00m 02s)
[11:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:08] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.9.1" for 540 hosts
[11:52:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:27] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.9.1" completed for 540 hosts
[11:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff)
[11:52:48] <kart_>	 marostegui: Thanks!!
[11:52:54] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done.
[11:57:09] <wikibugs>	 (03PS1) 10MSantos: add maps beta to dsh targets [puppet] - 10https://gerrit.wikimedia.org/r/803894
[11:59:33] <wikibugs>	 (03CR) 10Muehlenhoff: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[12:00:28] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:01:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1064
[12:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P29528 and previous config saved to /var/cache/conftool/dbconfig/20220608-120429-marostegui.json
[12:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:07] <logmsgbot>	 !log mvernon@cumin2002 END (ERROR) - Cookbook sre.swift.convert-ssds (exit_code=97) for host ms-be1064
[12:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:51] <wikibugs>	 (03PS3) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[12:14:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[12:19:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29529 and previous config saved to /var/cache/conftool/dbconfig/20220608-121934-marostegui.json
[12:19:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[12:19:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[12:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:40] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29530 and previous config saved to /var/cache/conftool/dbconfig/20220608-121942-marostegui.json
[12:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:36] <wikibugs>	 (03PS4) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[12:22:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[12:23:26] <wikibugs>	 (03PS5) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[12:28:19] <moritzm>	 !log installing rsyslog security updates on Buster
[12:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:29] <wikibugs>	 (03PS6) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[12:29:17] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35799/console" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[12:33:19] <wikibugs>	 (03PS7) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[12:33:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29531 and previous config saved to /var/cache/conftool/dbconfig/20220608-123320-marostegui.json
[12:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:25] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:36:12] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul)
[12:37:58] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:41:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:42:07] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul)
[12:47:40] <wikibugs>	 (03PS3) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[12:48:12] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995)
[12:48:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P29532 and previous config saved to /var/cache/conftool/dbconfig/20220608-124825-marostegui.json
[12:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:59] <wikibugs>	 (03CR) 10Slyngshede: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[12:56:32] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE) @Addshore Does it mean we need to then de-abandon that change, or should we just create a new patch to r...
[12:59:31] <wikibugs>	 (03CR) 10Jelto: "thanks for preparing the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1300).
[13:00:05] <jouncebot>	 Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:35] * urbanecm waves
[13:00:40] <urbanecm>	 Lucas_WMDE: i guess you'll self-serve?
[13:00:45] <Lucas_WMDE>	 yup
[13:00:54] * Lucas_WMDE looks up what I had scheduled ^^
[13:01:04] <Lucas_WMDE>	 ah yes
[13:01:12] <Lucas_WMDE>	 the big scary ’un
[13:01:37] <Lucas_WMDE>	 (socially scary, not technically scary – little risk of the site going down :D)
[13:01:38] <wikibugs>	 (03PS22) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[13:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:02:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[13:02:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I think the latest PS is good to merge, thank you John for your patience and assistance!" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[13:03:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[13:03:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P29533 and previous config saved to /var/cache/conftool/dbconfig/20220608-130330-marostegui.json
[13:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:48] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544)
[13:06:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "one +1, no complaints here or on Phabricator, should be good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE))
[13:07:00] <wikibugs>	 (03Merged) 10jenkins-bot: Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE))
[13:08:09] <Lucas_WMDE>	 new enwiki.png looks good on mwdebug1001, syncing
[13:09:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[13:10:43] <Lucas_WMDE>	 is there a scap option to skip php-fpm-restart?
[13:10:57] <Lucas_WMDE>	 I doubt these restarts are actually needed when I’m syncing a YAML or PNG file
[13:11:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946)
[13:11:08] * urbanecm doesn't know of any
[13:11:19] * urbanecm is also not happy that sync-file takes 3 times more than it used to be
[13:11:27] <Lucas_WMDE>	 ok
[13:11:41] <Lucas_WMDE>	 k8s will fix all of that ;)
[13:12:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (1/3, no-op) (duration: 03m 32s)
[13:12:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:40] <stashbot>	 T309544: enwiki.png slightly inconsistent with dewiki.png, enwiki-2x.png, dewiki-2x.png - https://phabricator.wikimedia.org/T309544
[13:12:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[13:12:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:00] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:21] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2029.codfw.wmnet
[13:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:16:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (2/3, no-op) (duration: 03m 35s)
[13:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:26] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:18:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29534 and previous config saved to /var/cache/conftool/dbconfig/20220608-131836-marostegui.json
[13:18:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:18:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:42] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:18:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29535 and previous config saved to /var/cache/conftool/dbconfig/20220608-131844-marostegui.json
[13:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:09] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2029.codfw.wmnet
[13:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/enwiki.png: Config: [[gerrit:801405|Refresh English Wikipedia logo file (enwiki.png) (T309544)]] (3/3, needs subsequent purge) (duration: 03m 44s)
[13:21:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:31] <stashbot>	 T309544: enwiki.png slightly inconsistent with dewiki.png, enwiki-2x.png, dewiki-2x.png - https://phabricator.wikimedia.org/T309544
[13:22:00] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ echo 'https://en.wikipedia.org/static/images/project-logos/enwiki.png' | mwscript purgeList.php # T309544
[13:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:24:55] <moritzm>	 !log installing rsyslog security updates on Bullseye
[13:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:44] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:25:44] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:11] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2030.codfw.wmnet
[13:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:36] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:56] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946)
[13:29:30] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:31:24] <wikibugs>	 (03CR) 10Muehlenhoff: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[13:32:07] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2030.codfw.wmnet
[13:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29536 and previous config saved to /var/cache/conftool/dbconfig/20220608-133420-marostegui.json
[13:34:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:26] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:35:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10Jclark-ctr) >>! In T307140#7988395, @faidon wrote: > I see on the list a EX4300-48T-AFI. That's likely a mistake -- it should not be that old (= old, but not 8 years old) and we have dozens of these still...
[13:37:09] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2031.codfw.wmnet
[13:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff)
[13:37:54] <moritzm>	 !log installing apache-log4j1.2 security updates
[13:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:50] <wikibugs>	 (03PS5) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[13:41:16] <wikibugs>	 (03PS4) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[13:41:19] <wikibugs>	 (03PS1) 10Eevans: Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896)
[13:41:21] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Use absolute namespace in Profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155)
[13:41:50] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[13:42:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[13:42:40] <wikibugs>	 (03PS5) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[13:43:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Or it could just be ServiceConfig::class, I suppose, since it’s the same namespace." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE))
[13:44:53] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2031.codfw.wmnet
[13:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:24] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08]: Regular analytics weekly train [analytics/refinery@64ddb08]
[13:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[13:49:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P29537 and previous config saved to /var/cache/conftool/dbconfig/20220608-134925-marostegui.json
[13:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:54] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2032.codfw.wmnet
[13:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:43] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.swift.convert-ssds for host ms-be1064
[13:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] backup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/801631 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:54:50] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.swift.convert-ssds (exit_code=99) for host ms-be1064
[13:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:48] <wikibugs>	 (03PS2) 10Muehlenhoff: exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013)
[13:56:07] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2032.codfw.wmnet
[13:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:45] <jan_drewniak>	 HI all, is it ok if I make a last-minute addition to the backport window? 
[13:57:25] <jan_drewniak>	 I'm just going to be deploying a portal update for some fundraising banners https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/803552 
[13:57:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10ssingh)
[13:57:53] <jan_drewniak>	 looks like there's nothing happening deployment wise right now
[13:58:37] <Lucas_WMDE>	 jan_drewniak: feel free to deploy
[13:58:52] <jan_drewniak>	 k thanks
[13:58:59] <wikibugs>	 (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546)
[13:59:31] <wikibugs>	 (03PS3) 10Ssingh: dnsdist: add support for retaining capabilites after startup [puppet] - 10https://gerrit.wikimedia.org/r/784270
[14:00:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "I'm happy with this. Would you like me to +2 and merge?" [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[14:00:33] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1064 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310160 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:00:37] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310160 (10ops-monitoring-bot)
[14:00:54] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:01:08] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2033.codfw.wmnet
[14:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:10] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[14:02:30] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:03:00] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.189 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:03:28] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: connect to address 10.64.32.190 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:04:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P29538 and previous config saved to /var/cache/conftool/dbconfig/20220608-140430-marostegui.json
[14:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:06] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.189 port 9042 https://phabricator.wikimedia.org/T93886
[14:05:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:06:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:17] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[14:06:18] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be1064.eqiad.wmnet
[14:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:09] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:803552| Bumping portals to master (T128546)]] (duration: 03m 30s)
[14:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:12] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[14:07:12] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2033.codfw.wmnet
[14:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is OK: TCP OK - 0.000 second response time on 10.64.32.190 port 9042 https://phabricator.wikimedia.org/T93886
[14:09:23] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08]: Regular analytics weekly train [analytics/refinery@64ddb08] (duration: 22m 59s)
[14:09:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:10:22] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:803552| Bumping portals to master (T128546)]] (duration: 03m 12s)
[14:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:14] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2034.codfw.wmnet
[14:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:13:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:13:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:13:50] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:16:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:10] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2034.codfw.wmnet
[14:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T310011)', diff saved to https://phabricator.wikimedia.org/P29539 and previous config saved to /var/cache/conftool/dbconfig/20220608-141936-marostegui.json
[14:19:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:19:40] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:22:48] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08] (thin): Regular analytics weekly train THIN [analytics/refinery@64ddb08]
[14:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:57] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08] (thin): Regular analytics weekly train THIN [analytics/refinery@64ddb08] (duration: 00m 09s)
[14:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:12] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2035.codfw.wmnet
[14:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:47] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@64ddb08] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@64ddb08]
[14:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:50] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:28:56] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2035.codfw.wmnet
[14:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:01] <wikibugs>	 (03CR) 10MVernon: [V: 03+2 C: 03+2] Dummy keys and certificates for cassandra (aqs) [labs/private] - 10https://gerrit.wikimedia.org/r/802631 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[14:30:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance
[14:30:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance
[14:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance
[14:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance
[14:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:00] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@64ddb08] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@64ddb08] (duration: 07m 12s)
[14:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:57] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet
[14:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[14:34:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[14:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29541 and previous config saved to /var/cache/conftool/dbconfig/20220608-143450-marostegui.json
[14:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:56] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:36:32] <wikibugs>	 (03PS4) 10Ori: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698)
[14:37:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:39:52] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2036.codfw.wmnet
[14:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:41:09] <wikibugs>	 (03CR) 10Ahmon Dancy: mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[14:42:03] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[14:42:55] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be1064 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T310181 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:43:00] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10ops-monitoring-bot)
[14:44:54] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2037.codfw.wmnet
[14:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:12] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight)
[14:46:03] <awight>	 I'll do a beta cluster config deployment now.
[14:46:06] <wikibugs>	 (03CR) 10Bking: [V: 03+1] Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson)
[14:47:25] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight)
[14:47:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29542 and previous config saved to /var/cache/conftool/dbconfig/20220608-144725-marostegui.json
[14:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:30] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:47:59] <wikibugs>	 (03CR) 10Bking: [V: 03+1 C: 03+2] Use a custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803628 (https://phabricator.wikimedia.org/T309648) (owner: 10Ebernhardson)
[14:48:09] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Switch maps rendering to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight)
[14:49:24] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2037.codfw.wmnet
[14:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:50:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935
[14:51:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:52:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:42] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[14:53:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936
[14:54:25] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet
[14:54:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:14] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] [beta] Switch maps rendering to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803932 (https://phabricator.wikimedia.org/T310150) (owner: 10Awight)
[14:56:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935 (owner: 10Filippo Giunchedi)
[14:57:02] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 (owner: 10Filippo Giunchedi)
[14:58:14] <ori>	 I want to set up simple monitoring for the function-* services on the beta cluster (deployment-prep), to alert on #wikipedia-abstract-tech when the service is down. Is there an existing setup I can use as reference?
[14:58:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:59:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:28] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet
[15:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Ping me on IRC tomorrow and then we can deploy." [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[15:01:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Bring in pingthing [puppet] - 10https://gerrit.wikimedia.org/r/803935 (owner: 10Filippo Giunchedi)
[15:02:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936 (owner: 10Filippo Giunchedi)
[15:02:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Bring in pingthing alerts [alerts] - 10https://gerrit.wikimedia.org/r/803936
[15:02:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P29543 and previous config saved to /var/cache/conftool/dbconfig/20220608-150230-marostegui.json
[15:02:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:25] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:05:29] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet
[15:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:09:34] <wikibugs>	 (03CR) 10Ahmon Dancy: scap: boostrap freshly provisioned scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[15:10:30] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet
[15:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10CDanis) a:03KFrancis
[15:13:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10CDanis) a:03MMiller_WMF
[15:13:42] <godog>	 !log trim swift logs older than 30d from centrallog2002 - T309171
[15:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:47] <stashbot>	 T309171: syslog / centrallog log volume growth - https://phabricator.wikimedia.org/T309171
[15:14:02] <godog>	 ori: I'm not aware of anything similar off the top of my head no
[15:14:30] <ori>	 ack, thanks (and hello)
[15:14:45] <wikibugs>	 (03PS1) 10Muehlenhoff: noc: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013)
[15:14:47] <wikibugs>	 (03PS1) 10Muehlenhoff: wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013)
[15:14:48] <godog>	 ori: hi! :D
[15:15:16] <godog>	 the good news is that for production monitoring is there via "probes" options in service::catalog
[15:15:46] <godog>	 not sure if that's the case for function-* though
[15:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:17:00] <godog>	 !log trim swift logs older than 30d from centrallog1001 - T309171
[15:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:19] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet
[15:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P29544 and previous config saved to /var/cache/conftool/dbconfig/20220608-151735-marostegui.json
[15:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1060.eqiad.wmnet with OS bullseye
[15:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:39] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1060.eqiad.wmnet with OS bullseye
[15:23:41] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[15:24:31] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet
[15:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) a:03CDanis For now I'll grant `analytics-privatedata-users` and if later it turns out more access is needed, @EBernhardson or @bscarone can re-...
[15:28:35] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:29:33] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
[15:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:53] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Pin Cassandra 3.11.13 as 'dev' [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[15:32:19] <wikibugs>	 (03PS1) 10Cwhite: logstash: add php7.2-fpm to mediawiki error,exception processing [puppet] - 10https://gerrit.wikimedia.org/r/803947 (https://phabricator.wikimedia.org/T234565)
[15:32:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T310011)', diff saved to https://phabricator.wikimedia.org/P29545 and previous config saved to /var/cache/conftool/dbconfig/20220608-153240-marostegui.json
[15:32:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[15:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[15:32:44] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[15:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29546 and previous config saved to /var/cache/conftool/dbconfig/20220608-153248-marostegui.json
[15:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:15] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
[15:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "Oh, there is an error from puppet following merge." [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[15:37:37] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1060.eqiad.wmnet with reason: host reimage
[15:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:21] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:38:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add php7.2-fpm to mediawiki error,exception processing [puppet] - 10https://gerrit.wikimedia.org/r/803947 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:38:56] <wikibugs>	 10SRE, 10Phabricator, 10serviceops-radar: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Dzahn) 05Open→03Declined something between resolved and declined. please feel free to reopen though if you feel differently about it.
[15:38:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29547 and previous config saved to /var/cache/conftool/dbconfig/20220608-153858-marostegui.json
[15:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:02] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[15:39:06] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) Noting the following settings from the deployment-prep horizon project puppet config page: ` profile:...
[15:40:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1060.eqiad.wmnet with reason: host reimage
[15:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:12] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:42:13] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:42:28] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[15:42:34] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:42:40] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:42:44] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:43:38] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:43:46] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:44:16] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:44:16] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:44:42] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:45:25] <dancy>	 jouncebot nowandnext
[15:45:25] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 14 minute(s)
[15:45:25] <jouncebot>	 In 2 hour(s) and 14 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800)
[15:45:26] <jouncebot>	 In 2 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800)
[15:45:48] <icinga-wm>	 RECOVERY - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:45:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: T309526 btullis
[15:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:57] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: T309526 btullis
[15:45:58] <icinga-wm>	 RECOVERY - Checks that the airflow database for airflow research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:58] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is OK: TCP OK - 0.000 second response time on 10.64.0.213 port 9042 https://phabricator.wikimedia.org/T93886
[15:48:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10calbon) Approved!
[15:49:47] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided)
[15:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:54:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P29548 and previous config saved to /var/cache/conftool/dbconfig/20220608-155403-marostegui.json
[15:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:52] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) I'm going to change profile::mediawiki::php::restarts::ensure to true and see how things go.
[15:54:54] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886
[15:55:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1060.eqiad.wmnet with OS bullseye
[15:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:06] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1060.eqiad.wmnet with OS bullseye completed: - ms-be1060 (**PASS**)   - Downtim...
[16:07:06] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:09:08] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet
[16:09:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P29549 and previous config saved to /var/cache/conftool/dbconfig/20220608-160908-marostegui.json
[16:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:03] <wikibugs>	 (03PS6) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[16:12:50] <wikibugs>	 (03PS7) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[16:13:06] <wikibugs>	 (03CR) 10Slyngshede: class:apt Add new private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[16:13:10] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet
[16:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:16] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
[16:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:25] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Pin Cassandra 3.11.13 as 'dev' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[16:13:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[16:18:48] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
[16:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:54] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
[16:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:14] <wikibugs>	 (03CR) 10Hashar: "I am dropping myself from the reviewers in favor of Jeena. She wrote that script as part of T255835 and knows about ruamel.yaml :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar)
[16:23:18] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
[16:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:24] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet
[16:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T310011)', diff saved to https://phabricator.wikimedia.org/P29550 and previous config saved to /var/cache/conftool/dbconfig/20220608-162413-marostegui.json
[16:24:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[16:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[16:24:17] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[16:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29551 and previous config saved to /var/cache/conftool/dbconfig/20220608-162422-marostegui.json
[16:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:32] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet
[16:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:38] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet
[16:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:53] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet
[16:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:59] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet
[16:32:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:25] <icinga-wm>	 PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:27] <icinga-wm>	 PROBLEM - Host ganeti5003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:33] <icinga-wm>	 PROBLEM - Host netflow5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:32:41] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:32:53] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:33:17] <icinga-wm>	 PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:33:39] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:33:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:34:01] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:35:37] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 1.782 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:35:43] <icinga-wm>	 RECOVERY - Host ganeti5003 is UP: PING OK - Packet loss = 0%, RTA = 223.20 ms
[16:35:53] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.373 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:35:59] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet
[16:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:06] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet
[16:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29552 and previous config saved to /var/cache/conftool/dbconfig/20220608-163737-marostegui.json
[16:37:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:42] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[16:40:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:40:31] <icinga-wm>	 RECOVERY - Host netflow5002 is UP: PING OK - Packet loss = 0%, RTA = 224.85 ms
[16:40:33] <icinga-wm>	 RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 223.35 ms
[16:40:33] <wikibugs>	 10SRE, 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Allow idrac ftp fetching of firmware updates (either to existing ftp or new solution) - https://phabricator.wikimedia.org/T283771 (10RobH)
[16:40:55] <icinga-wm>	 RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 246.01 ms
[16:41:23] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet
[16:41:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:29] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet
[16:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:49] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:43:41] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:43:47] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:43:49] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:44:11] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 347, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10nskaggs) @cmooney , for the manual override, https://wikitech.wikimedia.org/wiki/Network_design_-_Eqiad_WMCS_Network_...
[16:46:50] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet
[16:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:57] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet
[16:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:19] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:48:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:50:50] <XioNoX>	 let me know if you need any help
[16:51:38] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet
[16:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:45] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet
[16:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:14] <wikibugs>	 (03CR) 10Dzahn: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond)
[16:52:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:52:39] <legoktm>	 Huh?
[16:52:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P29553 and previous config saved to /var/cache/conftool/dbconfig/20220608-165242-marostegui.json
[16:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:47] <jayme>	 I'm around
[16:53:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:57:29] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet
[16:57:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:36] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet
[16:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:47] <icinga-wm>	 PROBLEM - Check systemd state on netflow5002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:10] <cdanis>	 legoktm: jayme: see #-sre
[17:01:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) @MoritzMuehlenhoff Thanks for checking in.  Because Goran is no longer an employee of WMDE, I should process a new NDA.  Would you please provide Goran's pe...
[17:02:03] <wikibugs>	 (03PS1) 10RLazarus: shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953
[17:02:17] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:02:20] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet
[17:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:27] <wikibugs>	 (03CR) 10Herron: [C: 03+1] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus)
[17:02:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus)
[17:03:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus)
[17:03:37] <wikibugs>	 (03CR) 10Krinkle: mediawiki: disable revalidation everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[17:04:45] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:05:28] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:06:07] <logmsgbot>	 !log dancy@deploy1002 prep aborted:  (duration: 24m 59s)
[17:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:25] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:06:51] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus)
[17:07:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P29554 and previous config saved to /var/cache/conftool/dbconfig/20220608-170747-marostegui.json
[17:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:11] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:08:21] <wikibugs>	 (03PS2) 10Krinkle: Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE))
[17:08:52] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE))
[17:09:09] <icinga-wm>	 PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:10:08] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Use absolute namespace in Excimer flush error handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803904 (https://phabricator.wikimedia.org/T310155) (owner: 10Lucas Werkmeister (WMDE))
[17:10:12] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Double the replicas due to overload [deployment-charts] - 10https://gerrit.wikimedia.org/r/803953 (owner: 10RLazarus)
[17:10:49] <icinga-wm>	 RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.73 ms
[17:11:26] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:41] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033)
[17:12:41] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:11] <rzl>	 !log the above "helmfile -i apply" was canceled
[17:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:20] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033)
[17:13:20] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply
[17:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:56] <logmsgbot>	 !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[17:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:39] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/Profiler.php: I534fb954c359c29a3f018eec75f62b4c4bfcd23f (duration: 03m 35s)
[17:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) ganeti3003 firmware updates  bios  2.2.11 to  2.14.2  nic  21.40.22.20 to  21.85.21.92  idrac  3.34.34.34 to  5.10.10.00
[17:15:23] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3003.esams.wmnet with OS bullseye
[17:15:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:15:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye
[17:17:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:34] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956
[17:18:36] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957
[17:19:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle)
[17:21:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:21:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:35] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori)
[17:21:51] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[17:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:39] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[17:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T310011)', diff saved to https://phabricator.wikimedia.org/P29555 and previous config saved to /var/cache/conftool/dbconfig/20220608-172252-marostegui.json
[17:22:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[17:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[17:22:57] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[17:22:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29556 and previous config saved to /var/cache/conftool/dbconfig/20220608-172305-marostegui.json
[17:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:24:11] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@e810fc7]: Update Wikibase section
[17:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:20] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@e810fc7]: Update Wikibase section (duration: 00m 08s)
[17:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:44] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:25:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:24] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:25:49] <icinga-wm>	 PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:26:02] <James_F>	 jouncebot: nowandnext
[17:26:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 33 minute(s)
[17:26:03] <jouncebot>	 In 0 hour(s) and 33 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800)
[17:26:03] <jouncebot>	 In 0 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800)
[17:26:14] <James_F>	 Cool, I'll sling out a Beta-Cluster-only one.
[17:26:17] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Oh, oops, yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori)
[17:26:45] <rzl>	 James_F: we're still working an issue with shellbox but I think you can proceed, as long as you're not doing anything Score-related
[17:26:55] <James_F>	 rzl: Yeah, just a `git pull` in /srv
[17:27:01] <James_F>	 Not even a scap.
[17:27:01] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA CLUSTER] Add wikifunctions to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803958 (https://phabricator.wikimedia.org/T300911) (owner: 10Ori)
[17:27:07] <James_F>	 (Done.)
[17:27:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:27:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:27:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 12 snaps in the admin project Andrew Bogott nicholas is investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[17:27:35] <ori>	 thanks
[17:28:00] <James_F>	 Now we just have to wait for Beta Cluster's update to verify.
[17:28:18] <jinxer-wm>	 (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:28:20] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:29:49] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:56] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:31:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:09] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:14] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:33] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3003.esams.wmnet with reason: host reimage
[17:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29558 and previous config saved to /var/cache/conftool/dbconfig/20220608-173536-marostegui.json
[17:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:39] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[17:36:38] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3003.esams.wmnet with reason: host reimage
[17:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:48] <wikibugs>	 (03CR) 10Dzahn: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond)
[17:37:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:38:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:51] <wikibugs>	 (03PS2) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956
[17:39:53] <wikibugs>	 (03PS2) 10Krinkle: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957
[17:43:56] <rzl>	 !log rolled back shellbox main to revision 2 on eqiad, to unstick a stuck upgrade
[17:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:52] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[17:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:33] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[17:45:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P29559 and previous config saved to /var/cache/conftool/dbconfig/20220608-175041-marostegui.json
[17:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10MMiller_WMF) I approve -- @KStoller-WMF needs access to these tools to analyze data as part of her product management role.
[17:52:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10MMiller_WMF) a:05MMiller_WMF→03CDanis
[17:54:12] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3003.esams.wmnet with OS bullseye
[17:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye completed: - ganeti3003 (**PASS**)   - Downtimed on Icinga/Ale...
[17:57:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3003 firmware updated and reimaged to bullseye (easy enough to fire the cookbook to reimage post firmware update to ensure the firmware update fixes...
[18:00:05] <jouncebot>	 dduvall and jeena: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800).
[18:00:05] <jouncebot>	 dduvall and jeena: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800).
[18:02:23] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@6b368f4]: Update more jobs to spark3
[18:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:37] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@6b368f4]: Update more jobs to spark3 (duration: 00m 13s)
[18:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:23] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:05:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P29560 and previous config saved to /var/cache/conftool/dbconfig/20220608-180546-marostegui.json
[18:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:31] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[18:12:33] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068)
[18:12:35] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:13:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] sre: update renamed otrs role to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:13:40] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803967 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:13:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "ACK, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:15:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:15:25] <wikibugs>	 (03PS3) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942)
[18:17:01] <wikibugs>	 (03CR) 10Dzahn: "ACK, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:17:27] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.15  refs T308068
[18:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:31] <stashbot>	 T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068
[18:19:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:26] <wikibugs>	 (03CR) 10RLazarus: "Two questions:" [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond)
[18:20:51] <wikibugs>	 (03PS1) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926)
[18:20:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T310011)', diff saved to https://phabricator.wikimedia.org/P29561 and previous config saved to /var/cache/conftool/dbconfig/20220608-182051-marostegui.json
[18:20:52] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.15  refs T308068 (duration: 03m 25s)
[18:20:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:20:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:56] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[18:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:29] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Entered https://wikimedia.coupahost.com/easy_form_responses/3234 into coupa for this work, Jin will coordinate with Arzhel via email and hangout for the actual work window.
[18:21:59] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "For posterity sake: This has now been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/803903 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[18:22:40] <dduvall>	 see a very large spike of errors on jsonTruncated channel
[18:22:44] <dduvall>	 rolling back
[18:23:21] <dduvall>	 also a handful of db errors for wikinews sites, "Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist"
[18:23:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:23:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:36] <wikibugs>	 (03PS1) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926)
[18:24:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:10] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.39.0-wmf.15"
[18:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:32:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:27] <wikibugs>	 (03PS1) 10Dduvall: Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972
[18:32:29] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972 (owner: 10Dduvall)
[18:33:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803972 (owner: 10Dduvall)
[18:34:08] <wikibugs>	 (03PS1) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973
[18:35:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[18:38:11] <urandom>	 !log uprading aqs1010.eqiad.wmnet to Cassandra 3.11.13 (canary) -- T309896
[18:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:16] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[18:38:25] <wikibugs>	 (03PS2) 10Mforns: airflow:manifests:instance.pp: Bump up number of DAG processors [puppet] - 10https://gerrit.wikimedia.org/r/803973
[18:39:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:40:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:30] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) >>! In T238751#7988593, @ItamarWMDE wrote: > @Addshore Does it mean we need to then de-abandon that change...
[18:41:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:58] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks mforns" [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[18:45:48] <wikibugs>	 (03CR) 10Jeena Huneidi: "This looks good to me but before merging we need to update to the newest version of ruamel on the integration agents. It's installed via p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/803886 (https://phabricator.wikimedia.org/T310133) (owner: 10Samtar)
[18:48:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[18:49:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[18:57:56] <urandom>	 is there anyone handy that understand our logstash configuration?
[18:58:45] <sukhe>	 herron, in case you are around ^
[18:58:57] <sukhe>	 (asking you since it says you are on-call; apologies if not)
[18:59:07] <herron>	 hey
[18:59:22] <urandom>	 heya
[18:59:51] <urandom>	 I just upgraded one Cassandra node, and now instead of the hostname in the logs it says %{HOSTNAME}
[19:00:09] <urandom>	 I'm sure something must have changed on the Cassandra end, but I noticed that this is the result of a filter: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/logstash/filters/20-filter_logback.conf
[19:01:06] <urandom>	 do you know how that filter works, or what it's (supposed to be) doing?
[19:01:28] <urandom>	 (looking at line #8)
[19:02:20] <urandom>	 https://logstash.wikimedia.org/goto/4ac95436ebea812c70a1b70cfd5338e4 is the upgraded node, and apparently the only one doing this...
[19:03:56] <herron>	 I'm guessing that the hostname is no longer being parsed out successfully from the log, so that mutate is replacing host with nothing essentially
[19:05:25] <urandom>	 is there any easy way of seeing what is *actually* being sent?
[19:05:54] <urandom>	 emphasis on "easy" :)
[19:07:07] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:08:27] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet
[19:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:59] <herron>	 urandom: which host was upgraded?
[19:09:10] <herron>	 aqs1010?
[19:09:14] <urandom>	 yes
[19:11:50] <herron>	 alright, yeah I think we can do something there to get at the raw logs 1 min
[19:14:09] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet
[19:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:16] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet
[19:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "safe enough since it just affects 'eqiad1.wikimedia.cloud]'" [puppet] - 10https://gerrit.wikimedia.org/r/803955 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[19:19:58] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet
[19:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:03] <herron>	 urandom: I added a temporary shim that should output raw logs to /tmp/logback_debug.log
[19:20:04] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet
[19:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:24] <urandom>	 herron: where is that?
[19:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:20:59] <urandom>	 herron: should I induce some output?
[19:21:04] <herron>	 aqs1010:/tmp/logback_debug.log  which is output by rsyslog on the host
[19:21:07] <herron>	 yes please
[19:22:04] <urandom>	 herron: there are a few
[19:22:18] <urandom>	 herron: and now it's likely to get really chatty
[19:23:22] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet
[19:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:28] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet
[19:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:15] <urandom>	 herron: OK, so 'host' (I guess that was the old property?) wasn't renamed to something, it's just gone altogether
[19:25:58] <herron>	 yeah looks like HOSTNAME is missing from the source logs
[19:27:54] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet
[19:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:01] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet
[19:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:33] <icinga-wm>	 PROBLEM - Apache HTTP on mw1415 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 974 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:44] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[19:32:16] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[19:32:46] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet
[19:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "regarding that single commit from the commit message: I am not sure who the unknown author was but PS1 and PS2 seem to be identical. but w" [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[19:34:02] <mutante>	 jouncebot: now
[19:34:02] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T1800)
[19:34:31] <mutante>	 deployers: that 500 Internal server error is from a canary. careful
[19:35:20] <dancy>	 hmm
[19:38:37] <herron>	 urandom: logback_debug.log is looking better now.  did the upgrade clobber logback custom fields config or something?
[19:39:14] <urandom>	 herron: nope, I live-hacked a test fix
[19:39:18] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "Ariel and I tested this and it looks good.." [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar)
[19:39:52] <urandom>	 herron: I set HOSTNAME as a custom field (using logbacks ${HOSTNAME})
[19:40:06] <herron>	 ah gotcha, so that's new
[19:40:49] <urandom>	 I'm going to codify that in a changeset so that I don't have to rollback mid-upgrade, but it's probably a work-around
[19:41:15] <herron>	 urandom: cool sounds good, yeah seems to be working well enough
[19:41:46] <urandom>	 upstream Cassandra upgraded logback from 1.1.3 to 1.2.9 (a huge jump), and I'm not even sure we're using the "right" appender anymore
[19:42:08] <urandom>	 herron: I'll probably prod you for a code review here in a bit :)
[19:42:18] <herron>	 I'll leave that logback_debug.log config in place, the next puppet run will undo it
[19:42:35] <herron>	 urandom: ok will keep an eye out for it
[19:42:51] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "scap.cfg.erb: Define php_fpm restart settings for beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/803908
[19:45:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "scap.cfg.erb: Define php_fpm restart settings for beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/803908 (owner: 10Ahmon Dancy)
[19:46:16] <wikibugs>	 (03PS1) 10Eevans: Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896)
[19:46:57] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet
[19:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:05] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1415 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 871 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:49:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "PPC output: https://puppet-compiler.wmflabs.org/pcc-worker1002/35800/" [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[19:49:25] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[19:50:09] <urandom>	 herron: the old config was a symlink, so I had to break that use a copy of the target (so the diff is large).  The puppet compiler output shows the actual (tiny) change.
[19:51:30] <herron>	 urandom: got it, lgtm!
[19:51:33] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet
[19:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:14] <herron>	 urandom: want me to merge this?
[19:52:21] <urandom>	 yes please!
[19:52:28] <herron>	 kk doing
[19:52:39] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Set HOSTNAME as a custom Cassandra logback field [puppet] - 10https://gerrit.wikimedia.org/r/803978 (https://phabricator.wikimedia.org/T309896) (owner: 10Eevans)
[19:52:49] <urandom>	 herron: thanks for all your help!
[19:53:46] <herron>	 urandom: any time!  ready for puppet to run on aqs1010 now?
[19:54:01] <urandom>	 sure (although that I actually can do) :)
[19:54:24] <herron>	 ah even better, I will leave you to it then!
[19:54:43] <urandom>	 for hysterical raisins I have root on those clusters
[19:55:21] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1415 is CRITICAL: Host mw1415 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:55:34] <urandom>	 herron: I see you have it disable, and assume it's OK to reenable?
[19:55:41] <urandom>	 *disabled
[19:56:22] <herron>	 yup, was just disabled to avoid clobbering the rsyslog logback_debug.log hack, ready to re-enable
[19:58:51] <urandom>	 !log restarting Cassandra, aqs1010-{a,b}, to apply logback work-around --  T309896
[19:58:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:55] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220608T2000).
[20:00:05] <jouncebot>	 mewoph: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <urbanecm>	 hi mewoph! I can deploy today.
[20:00:48] <mewoph>	 thanks! we have some unrelated failing tests  :(
[20:01:20] <urbanecm>	 mewoph: was just going to mention that. do we know why they fail?
[20:01:30] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761)
[20:04:36] <mewoph>	 I see that the alt text is missing in the string comparison in the failing test, but it's from ParserIntegrationTest so that's most likely not related to GrowthExperiments change
[20:05:14] <dancy>	 mutante: I filed https://phabricator.wikimedia.org/T310225
[20:06:02] <urbanecm>	 mewoph: do you know whether this is also an issue on master, or just in wmf.XX? (didn't do anything in GE today, so not sure myself)
[20:08:21] <mutante>	 dancy: thanks! ack
[20:10:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: canary curator fork on codfw [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite)
[20:13:26] <urbanecm>	 mewoph: ping re my above message :)
[20:13:32] <mewoph>	 sorry i don't know
[20:13:45] <urbanecm>	 okay
[20:14:23] <mewoph>	 we might need to re-schedule this backport :(
[20:14:35] <urbanecm>	 yeah, I don't really want to overrule CI without knowing why it fails :/
[20:17:17] <kostajh>	 urbanecm: it fails because parser integration tests periodically go out of date
[20:18:04] <kostajh>	 urbanecm: see e.g. T265024
[20:18:07] <stashbot>	 T265024: Parser tests are broken for GrowthExperiments - https://phabricator.wikimedia.org/T265024
[20:18:31] <kostajh>	 oops, T302964 is a better reference
[20:18:31] <stashbot>	 T302964: ParserIntegrationTest::testParse with data set "parserTests.txt: Media link with nested wikilinks" ('legacy parser') - https://phabricator.wikimedia.org/T302964
[20:18:41] <kostajh>	 specifically https://phabricator.wikimedia.org/T302964#7750242
[20:20:04] <wikibugs>	 (03CR) 10Herron: [C: 04-2] Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[20:22:45] <kostajh>	 urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/803954 would need to be backported to wmf.14 and wmf.15, I think
[20:23:06] <kostajh>	 but anyway, I think force merging is fine as this issue is unrelated.
[20:29:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) @leila I saw on the Research project page you linked that the project lasts through August, so I set an `expiry_date` of Sept 1st 2022 in my patc...
[20:31:09] <wikibugs>	 (03PS1) 10CDanis: bscarone: shell/analytics/krb access [puppet] - 10https://gerrit.wikimedia.org/r/803982 (https://phabricator.wikimedia.org/T310021)
[20:33:08] <dduvall>	 !log rolling back group0 as well due to T310214
[20:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:13] <stashbot>	 T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist - https://phabricator.wikimedia.org/T310214
[20:33:52] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] bscarone: shell/analytics/krb access [puppet] - 10https://gerrit.wikimedia.org/r/803982 (https://phabricator.wikimedia.org/T310021) (owner: 10CDanis)
[20:35:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:35:46] <wikibugs>	 (03CR) 10Arlolra: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[20:35:59] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.15"
[20:36:00] <wikibugs>	 (03CR) 10Arlolra: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[20:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:53] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:36:54] <wikibugs>	 (03PS1) 10Dduvall: Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983
[20:36:56] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983 (owner: 10Dduvall)
[20:37:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.15" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803983 (owner: 10Dduvall)
[20:37:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Bruno Scarone - https://phabricator.wikimedia.org/T310021 (10CDanis) 05Open→03Resolved @bscarone you should now be able to use that SSH key to access production per the shell access instru...
[20:41:52] <wikibugs>	 (03PS1) 10CDanis: kstoller analytics access [puppet] - 10https://gerrit.wikimedia.org/r/803984 (https://phabricator.wikimedia.org/T310002)
[20:42:27] <Krinkle>	 !log krinkle@mw1415: Run `scap pull` manually ref T310225
[20:42:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:31] <stashbot>	 T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225
[20:42:39] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10wiki_willy) a:03Cmjohnson
[20:42:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:01] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310160 (10wiki_willy) a:03Cmjohnson
[20:43:19] <wikibugs>	 10SRE, 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10wiki_willy) a:03Jclark-ctr
[20:43:36] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] kstoller analytics access [puppet] - 10https://gerrit.wikimedia.org/r/803984 (https://phabricator.wikimedia.org/T310002) (owner: 10CDanis)
[20:43:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10wiki_willy) a:03Cmjohnson
[20:43:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:43:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset & Turnilo for kstoller - https://phabricator.wikimedia.org/T310002 (10CDanis) 05Open→03Resolved Access should be live within 30 minutes!  Please re-open if you have any trouble.
[20:47:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10leila) approved. (and the expiration date for access can be set to September 1, 2022.) Thanks!
[20:52:29] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet
[20:52:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:58:27] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet
[20:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:08] <mewoph>	 urbanecm: the wmf14 patch tests are passing again, right at the end of the backport window :/ should we re-schedule or is it still ok to backport now?
[21:01:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:03:07] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [8
[21:03:07] <icinga-wm>	 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[21:04:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:07:45] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given sourc
[21:07:45] <icinga-wm>	 ns) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[21:09:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:10:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:19] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:14:23] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:14:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:14:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:15:43] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly)
[21:16:12] <wikibugs>	 (03PS1) 10Jdlrobson: [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398)
[21:17:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:17:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:13] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet
[21:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:20:37] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1415 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:23:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:39] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[21:28:26] <cjming>	 ok if i do a quick labs deploy?
[21:32:21] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:33:03] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:35] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 4.137 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:35:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:40:12] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1415.eqiad.wmnet
[21:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:10] <mutante>	 !log repooled mw1415 after restarting apache and php-fpm, seeing all Icinga alerts recover etc  T307755 T310225
[21:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:16] <stashbot>	 T307755: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755
[21:41:16] <stashbot>	 T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225
[21:45:29] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on...
[21:46:19] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 05In progress→03Resolved a:03Dzahn
[21:47:10] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398) (owner: 10Jdlrobson)
[21:48:07] <wikibugs>	 (03Merged) 10jenkins-bot: [beta cluster] Enable VectorTitleAboveTabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803988 (https://phabricator.wikimedia.org/T309398) (owner: 10Jdlrobson)
[21:52:28] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:803988|[beta cluster] Enable VectorTitleAboveTabs (T309398)]] (duration: 03m 32s)
[21:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:32] <stashbot>	 T309398: Toolbar styling - https://phabricator.wikimedia.org/T309398
[21:53:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr)
[22:00:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:00:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:23] <wikibugs>	 (03PS1) 10Bking: elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648)
[22:10:59] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking)
[22:11:57] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:15:25] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: increment BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/803996 (https://phabricator.wikimedia.org/T309648) (owner: 10Bking)
[22:24:56] <urbanecm>	 mewoph: kostajh: sorry, i was afk as i thought we don't know why it is not failing. let's do it tomorrow please :)
[22:25:17] <urbanecm>	 *why it _is_ failing
[22:25:32] <wikibugs>	 (03PS1) 10Krinkle: rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214)
[22:29:59] <mewoph>	 urbanecm: thanks! i added it to tomorrow's window
[22:30:04] <urbanecm>	 thanks!
[22:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:32:43] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:39:23] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:42:07] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:42:07] <wikibugs>	 (03CR) 10Tim Starling: "> maybe a broader solution is needed" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling)
[22:44:21] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:46:17] <wikibugs>	 (03PS1) 10Ryan Kemper: Bump changelog for custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804003 (https://phabricator.wikimedia.org/T309648)
[22:46:28] <wikibugs>	 (03PS1) 10Ebernhardson: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004
[22:47:23] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Bump changelog for custom repository-s3 snapshot [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804003 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper)
[22:54:49] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:54:51] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 27.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:57:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 79.27 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:57:11] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[22:57:12] <ryankemper>	 ^ I would expect these to heal soon given that the traffic looks normal again
[22:57:16] <ryankemper>	 Oh, just beat me to it
[22:58:13] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) a:05Cmjohnson→03Andrew I think this should be assigned to me, to put the new hosts into service. That's currently blocked by a...
[23:01:08] <wikibugs>	 (03CR) 10Ryan Kemper: "Great idea. See inline for one small question" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson)
[23:02:49] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:06:44] <Seddon>	 anyone getting dns issues? or just me?
[23:07:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Add a check that deb is unreleased in prepare_commit (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson)
[23:07:38] <wikibugs>	 (03PS2) 10Ryan Kemper: Add a check that deb is unreleased in prepare_commit [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/804004 (owner: 10Ebernhardson)
[23:08:40] <ryankemper>	 !log T309648 Built `wmf-elasticsearch-search-plugins_6.8.23-3` (https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/804003) following steps in https://phabricator.wikimedia.org/P19522. Result: https://apt.wikimedia.org/wikimedia/pool/component/elastic68/w/wmf-elasticsearch-search-plugins/
[23:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:45] <stashbot>	 T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
[23:08:59] <rzl>	 Seddon: it wouldn't be Virgin Media would it? not just you but it looks like your ISP
[23:09:24] <Seddon>	 rzl, possibly yes
[23:11:08] <Seddon>	 rzl: I switch my mac to google dns and it all resolved
[23:11:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[23:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:10] <rzl>	 Seddon: good to hear -- browser reports from country=GB are trending down too, so it might have just been good timing
[23:12:24] <rzl>	 either way, appreciate the report! nothing for us to do as it turned out, but there might be next time
[23:12:43] <wikibugs>	 (03Abandoned) 10Ryan Kemper: elasticsearch: add ANSI color codes [cookbooks] - 10https://gerrit.wikimedia.org/r/800244 (https://phabricator.wikimedia.org/T308606) (owner: 10Ryan Kemper)
[23:15:30] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[23:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:35] <stashbot>	 T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
[23:15:49] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:25:53] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[23:25:53] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX