[00:00:05] <jouncebot>	 brennen and thcipriani: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0000).
[00:01:33] <icinga-wm>	 RECOVERY - Check systemd state on miscweb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:01:49] <icinga-wm>	 RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214) (owner: 10Krinkle)
[00:09:15] * Krinkle staging on  mwdebug1002
[00:10:37] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle)
[00:11:40] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803956 (owner: 10Krinkle)
[00:12:49] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 (owner: 10Krinkle)
[00:14:03] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Inject 'statsd' option from PhpAutoPrepend.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803957 (owner: 10Krinkle)
[00:15:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[00:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:06] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: I5810472ae (duration: 03m 20s)
[00:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:39] <wikibugs>	 (03PS2) 10Krinkle: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761)
[00:17:43] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle)
[00:19:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[00:19:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[00:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:26] <wikibugs>	 (03Merged) 10jenkins-bot: Profiler: Remove unused mongodb 'xhgui' option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803979 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle)
[00:21:03] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/Profiler.php: I14ebd2e93ad (duration: 03m 31s)
[00:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[00:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: move mysql isQuotedIdentifier() override to SQLPlatform [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803909 (https://phabricator.wikimedia.org/T310214) (owner: 10Krinkle)
[00:27:26] <wikibugs>	 (03PS1) 10Cwhite: logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565)
[00:28:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[00:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[00:29:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[00:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[00:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:14] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/Profiler.php: I43a9e838c287 (1/4) (duration: 03m 32s)
[00:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:34:22] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: I43a9e838c28745906 (2/4) (duration: 03m 37s)
[00:34:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[00:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:40] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: I43a9e838c28745906 Labs+ProductionServices (3+4/4) (duration: 03m 36s)
[00:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[00:39:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[00:39:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[00:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:51] <logmsgbot>	 !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.15/includes/libs/rdbms/: I99b817b3d50ffcdf56, T310214 (duration: 03m 23s)
[00:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:54] <stashbot>	 T310214: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'enwikinews.`categorylinks`' doesn't exist - https://phabricator.wikimedia.org/T310214
[00:51:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:52:57] <wikibugs>	 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10Krinkle) Ref {T310225}.  Ref <https://wikitech.wikimedia.org/w/index.php?title=Monitoring%2Fcheck_dsh_groups&diffonly=0&diff=1987914&oldid=1834094#Inactive_servers>....
[00:53:55] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.647 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:04:29] <wikibugs>	 (03PS2) 10Krinkle: multiversion: Simplify code and improve documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308
[01:05:21] <wikibugs>	 (03CR) 10Krinkle: "We can probably cut a fair bit of this down in future patches, but this is mostly a dump of my prior knowledge and whatever I could find i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle)
[01:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:33:12] <wikibugs>	 (03PS1) 10Catrope: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890)
[01:51:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:53:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:50:33] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:50:51] <wikibugs>	 (03PS1) 10Samwilson: [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725)
[03:05:18] <wikibugs>	 (03PS5) 10Tim Starling: Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (https://phabricator.wikimedia.org/T212129)
[03:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:38:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:38:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:44:49] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:53:39] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995)
[04:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:34:41] <wikibugs>	 (03PS1) 10DLynch: Sync sampling rates at 9 wikis DiscussionTools is testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804022 (https://phabricator.wikimedia.org/T309260)
[04:45:52] * kart_ deploying cxserver..
[04:46:09] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) (owner: 10KartikMistry)
[04:49:18] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-06-08-124326-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/803901 (https://phabricator.wikimedia.org/T306995) (owner: 10KartikMistry)
[04:50:02] <wikibugs>	 (03PS1) 10Tim Starling: Switch wgMainStash back to Redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804024 (https://phabricator.wikimedia.org/T212129)
[04:54:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[04:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:45] <kart_>	 .. and staging upgrade seems stuck..
[05:03:19] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:04:15] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:06:35] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:08:53] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:09:34] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:55] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) Mark asked me to prepare a rollback plan which can be used to switch back to Redis if something goes w...
[05:12:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:13:01] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:03] <kart_>	 https://phabricator.wikimedia.org/P29564 is log of failure, looks like timeout. Will surely need service SREs to debug further.
[05:19:47] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:01] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:23:31] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:24:39] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:32:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[05:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[05:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29565 and previous config saved to /var/cache/conftool/dbconfig/20220609-053253-marostegui.json
[05:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:57] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[05:36:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:41:15] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:43:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29566 and previous config saved to /var/cache/conftool/dbconfig/20220609-054306-marostegui.json
[05:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:10] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[05:58:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29567 and previous config saved to /var/cache/conftool/dbconfig/20220609-055811-marostegui.json
[05:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0600).
[06:02:21] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for
[06:02:21] <icinga-wm>	 ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[06:04:29] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:07:15] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:08:39] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:10:59] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:13:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29568 and previous config saved to /var/cache/conftool/dbconfig/20220609-061316-marostegui.json
[06:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:43] <wikibugs>	 (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/804190
[06:15:10] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) Thank you Tim! I will bring this up on our Team meeting on Monday.
[06:16:29] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[06:17:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/804190 (owner: 10Marostegui)
[06:18:47] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:11] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:24:53] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:27:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:28:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T310011)', diff saved to https://phabricator.wikimedia.org/P29569 and previous config saved to /var/cache/conftool/dbconfig/20220609-062821-marostegui.json
[06:28:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:28:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:27] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29570 and previous config saved to /var/cache/conftool/dbconfig/20220609-062829-marostegui.json
[06:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[06:30:03] <wikibugs>	 (03PS2) 10Muehlenhoff: wikistats: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803944 (https://phabricator.wikimedia.org/T308013)
[06:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:34:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29571 and previous config saved to /var/cache/conftool/dbconfig/20220609-063443-marostegui.json
[06:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:48] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:42:33] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:49:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29572 and previous config saved to /var/cache/conftool/dbconfig/20220609-064948-marostegui.json
[06:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet
[06:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:05] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[06:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:34] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[06:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10akosiaris) Hi everyone,  Since the last comment is from 2 years ago from a person no longer with t...
[07:00:04] <jouncebot>	 Amir1 and apergos: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T0700).
[07:00:04] <jouncebot>	 samwilson: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:16] <apergos>	 good morning!
[07:00:19] <apergos>	 we have a trainee signed up
[07:01:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet
[07:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:55] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:02:17] <apergos>	 hey samwilson: you about? 
[07:02:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29573 and previous config saved to /var/cache/conftool/dbconfig/20220609-070453-marostegui.json
[07:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:48] <samwilson>	 apergos: yep, here now! sorry, was eating an apple.
[07:06:26] <apergos>	 we have a trainee today
[07:06:32] <apergos>	 I forget if you do self deploys or not
[07:07:04] <samwilson>	 no, very happy for you to do it, or whoever's learning :)
[07:07:05] <apergos>	 samwilson: 
[07:07:08] <apergos>	 ok!
[07:07:43] <apergos>	 Amir1: do you happen to be about? I'd prefer not to train and deploy at the same time, though if you can't be here, that's ok
[07:10:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3003.esams.wmnet to ganeti01.svc.esams.wmnet
[07:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:19] <apergos>	 I'm goign to assume Amir is not here and we'll proceed. this will be slower than normal because I'm talking through the procedure with our trainee :-)
[07:11:42] <samwilson>	 no worries. I'm around for the next hours.
[07:12:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:12:03] <moritzm>	 apergos: Amir1 is on vacation 
[07:12:10] <apergos>	 ok. no worries, thanks for the info
[07:12:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti3003.esams.wmnet to ganeti01.svc.esams.wmnet
[07:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:12:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:32] <moritzm>	 !log drain ganeti3002 for firmware update/reimage T308238
[07:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:36] <stashbot>	 T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238
[07:18:31] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson)
[07:19:18] <wikibugs>	 (03Merged) 10jenkins-bot: [beta cluster] Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804017 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson)
[07:19:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T310011)', diff saved to https://phabricator.wikimedia.org/P29574 and previous config saved to /var/cache/conftool/dbconfig/20220609-071958-marostegui.json
[07:20:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[07:20:03] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29575 and previous config saved to /var/cache/conftool/dbconfig/20220609-072006-marostegui.json
[07:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:23:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10Volans) 05Open→03Resolved a:03Volans This can be solved, was just forgotten AFAICT. We do us...
[07:26:14] <apergos>	 samwilson: your change is now live on mwdebug1002, feel free to test ;-) 
[07:26:30] <samwilson>	 apergos: thanks, testing now
[07:28:42] <samwilson>	 apergos: hmm, I may have got something wrong with the config. I'm expecting to see a change at e.g. https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Test&action=edit but it's not working.
[07:28:56] <samwilson>	 (with debug1002)
[07:29:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:45] <apergos>	 well you can't really route changes to mwdebug (prod) for beta cluster requests
[07:29:58] <apergos>	 that's a bit of a trick "testing" request :-)
[07:30:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:30:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:12] <samwilson>	 oh hehe right!
[07:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:31] <apergos>	 now the +2 should have made your changes show up immediately in beta, I think. that's certainly true for mw core and extensions
[07:30:40] <samwilson>	 so.... when's config go live for beta cluster?
[07:30:41] <apergos>	 I think it's true for wmf-config
[07:30:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:59] <apergos>	 can you check on a random app server over there or should I?
[07:31:33] <samwilson>	 to see if the config has been updated?
[07:31:36] <jnuche>	 samwilson: it seems the config change should be in beta -> https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/
[07:31:44] <samwilson>	 I'm not actually sure where to check; can you do it?
[07:31:54] <apergos>	 thanks jnuche for that! 
[07:32:01] <jnuche>	 in particular: https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/396/
[07:32:10] <apergos>	 ( jnuche is our trainee doing the deployment today.)
[07:32:43] <samwilson>	 jnuche: thanks! hmm I think I've got the config wrong then
[07:34:38] <samwilson>	 If I quickly make a follow-up patch now can you deploy it?
[07:34:41] <apergos>	 I do see the config on a random mw instance in deployment-prep
[07:34:49] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:35:12] <wikibugs>	 (03PS1) 10Samwilson: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725)
[07:35:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:35:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29576 and previous config saved to /var/cache/conftool/dbconfig/20220609-073546-marostegui.json
[07:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:51] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:35:51] <wikibugs>	 (03PS2) 10Samwilson: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725)
[07:36:00] <apergos>	 samwilson: should we deploy the current one around first, then the new one too, or revert ?
[07:36:58] <samwilson>	 hmm is it bad to deploy and then follow-up? or would you rather have a revert? 
[07:37:05] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/803912
[07:37:51] <samwilson>	 apergos: the new patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/804255 
[07:38:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/803912 (owner: 10Marostegui)
[07:38:29] <apergos>	 samwilson: in this case since it's beta, I think it's fine to just deploy the first one fully and then follow up. 
[07:38:59] <samwilson>	 okay, cool. yeah that sounds good. I'm sorry for the messiness!
[07:39:31] <apergos>	 no worries, we'll get it sorted :-)
[07:39:51] <apergos>	 so for the next patch, we'll do this in three steps
[07:39:54] <apergos>	 1) merge
[07:39:58] <samwilson>	 jnuche: I'm just trying to give you more practice! :P 
[07:40:00] <apergos>	 2) you test in beta then, once it's available
[07:40:07] <jnuche>	 :D
[07:40:14] <apergos>	 3) once that checks out, continue with regular deploy. sound good?
[07:40:29] <samwilson>	 apergos: yep that makes sense
[07:43:09] <logmsgbot>	 !log jnuche@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:804017|[beta cluster] Update $wgVectorMaxWidthOptions to include action=edit (T307725)]] (duration: 03m 41s)
[07:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:13] <stashbot>	 T307725: Make action=edit with 2010 wikitext editor a full-width page in Vector-2022 - https://phabricator.wikimedia.org/T307725
[07:43:39] <apergos>	 ordinarily we would ask you to test again but in this case, skipping that step, samwilson :-)
[07:43:59] <mmandere>	 !log  depool cp5006  for trouble shooting instance state unknown
[07:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:02] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson)
[07:44:09] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:44:20] <samwilson>	 sure!
[07:44:41] <apergos>	 samwilson:  please do put your second patch in the calendar
[07:44:47] <wikibugs>	 (03Merged) 10jenkins-bot: [beta cluster] Fix $wgVectorMaxWidthOptions array depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804255 (https://phabricator.wikimedia.org/T307725) (owner: 10Samwilson)
[07:44:56] <samwilson>	 oh yep, good point; doing now
[07:45:00] <apergos>	 both for the record and for easy access to the deployment commands link :-
[07:45:17] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:45:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) a:05calbon→03None
[07:45:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) Chris approved, I think that we can proceed!
[07:45:58] <samwilson>	 apergos: done https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_June_9
[07:46:12] <apergos>	 awesome! 
[07:47:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey)
[07:47:35] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:48:09] <jnuche>	 samwilson: the second patch is in beta, please go ahead and take a look :)
[07:49:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) My understanding is that we are requiring the same level of access to ORES nodes to a new team member, refactoring a bit how groups are related to each o...
[07:49:15] <samwilson>	 jnuche: hooray yep it works now :-) 
[07:49:32] <apergos>	 great!
[07:50:40] <apergos>	 stick around please, we'll want to have you here all the way through the official deployment
[07:50:40] <samwilson>	 thanks for putting up with my confusions!
[07:50:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29577 and previous config saved to /var/cache/conftool/dbconfig/20220609-075051-marostegui.json
[07:50:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:27] <samwilson>	 yep, no worries I'm still here
[07:51:36] <jnuche>	 samwilson: no worries! the change is now on mwdebug1001, but there's no testing to be done so we'll move on to syncing the rest of prod
[07:52:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:29] <samwilson>	 I've checked anyway, and all is well :)
[07:53:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:38] <elukey>	 !log drop DRDB disk template from ml-etcd2* nodes - T310073
[07:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:42] <stashbot>	 T310073: Investigate high latencies registered by the ml-serve api control plane - https://phabricator.wikimedia.org/T310073
[07:55:40] <logmsgbot>	 !log jnuche@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:804255|[beta cluster] Fix $wgVectorMaxWidthOptions array depth (T307725)]] (duration: 03m 40s)
[07:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:45] <stashbot>	 T307725: Make action=edit with 2010 wikitext editor a full-width page in Vector-2022 - https://phabricator.wikimedia.org/T307725
[07:56:29] <jnuche>	 samwilson: done!
[07:56:45] <jnuche>	 both changes have been sync'ed out everywhere
[07:58:09] <samwilson>	 jnuche: terrific, thanks :-)
[07:58:10] <wikibugs>	 (03PS1) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256
[07:58:29] <apergos>	 !log UTC morning backport and config training window done
[07:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:01] <icinga-wm>	 PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:01:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1061.eqiad.wmnet with OS bullseye
[08:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:32] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1061.eqiad.wmnet with OS bullseye
[08:01:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi)
[08:01:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:02:19] <icinga-wm>	 RECOVERY - etcd request latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'upstream/0.0.7' into debian [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803870 (owner: 10Filippo Giunchedi)
[08:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: New release 0.0.7-1 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/803871 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi)
[08:02:55] <wikibugs>	 (03PS2) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256
[08:05:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29578 and previous config saved to /var/cache/conftool/dbconfig/20220609-080556-marostegui.json
[08:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:09:45] <wikibugs>	 (03PS8) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512
[08:12:13] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:20:02] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1061.eqiad.wmnet with reason: host reimage
[08:21:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T310011)', diff saved to https://phabricator.wikimedia.org/P29579 and previous config saved to /var/cache/conftool/dbconfig/20220609-082102-marostegui.json
[08:21:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:21:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[08:22:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Merging so we can (PCC) test at least" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[08:22:50] <wikibugs>	 (03PS23) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[08:23:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1061.eqiad.wmnet with reason: host reimage
[08:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:32:12] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:32:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:32:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[08:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29580 and previous config saved to /var/cache/conftool/dbconfig/20220609-083232-marostegui.json
[08:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:37] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[08:34:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[08:38:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1061.eqiad.wmnet with OS bullseye
[08:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:26] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1061.eqiad.wmnet with OS bullseye completed: - ms-be1061 (**PASS**)   - Downtim...
[08:39:26] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:40:21] <logmsgbot>	 !log mmandere@cumin1001 conftool action : set/pooled=no; selector: name=cp5006.*
[08:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:04] <wikibugs>	 (03PS5) 10Jaime Nuche: scap: bootstrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713)
[08:41:24] <wikibugs>	 (03CR) 10Jaime Nuche: scap: bootstrap freshly provisioned scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[08:42:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[08:46:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:46:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29581 and previous config saved to /var/cache/conftool/dbconfig/20220609-084620-marostegui.json
[08:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:25] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[08:46:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] scap: bootstrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[08:48:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Fix zookeeper typo [puppet] - 10https://gerrit.wikimedia.org/r/804263
[08:51:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:01:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29583 and previous config saved to /var/cache/conftool/dbconfig/20220609-090125-marostegui.json
[09:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: phabricator: add blackbox http check [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847)
[09:03:18] <wikibugs>	 (03PS1) 10Volans: reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267
[09:04:25] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: fix bootstrap exec command [puppet] - 10https://gerrit.wikimedia.org/r/804268 (https://phabricator.wikimedia.org/T309713)
[09:05:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks. Yes I think it's fine to start the new metrics based on the fixed typo." [puppet] - 10https://gerrit.wikimedia.org/r/804263 (owner: 10Filippo Giunchedi)
[09:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:05:59] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:07:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Cheers Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/804263 (owner: 10Filippo Giunchedi)
[09:08:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans)
[09:08:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35802/console" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:08:29] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:10:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] scap: fix bootstrap exec command [puppet] - 10https://gerrit.wikimedia.org/r/804268 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[09:10:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans)
[09:11:22] <wikibugs>	 (03Merged) 10jenkins-bot: reports.coherence: exclude patch panels [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/804267 (owner: 10Volans)
[09:11:53] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:12:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1143 on s4 with small weight after installing 10.6 T310114', diff saved to https://phabricator.wikimedia.org/P29584 and previous config saved to /var/cache/conftool/dbconfig/20220609-091224-root.json
[09:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:29] <stashbot>	 T310114: Migrate a s4 DB host to mariadb 10.6 - https://phabricator.wikimedia.org/T310114
[09:14:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "See PCC, AFAIK it isn't possible to preview the changes on prometheus hosts due to exported resources usage (i.e. we see diff on phab host" [puppet] - 10https://gerrit.wikimedia.org/r/804266 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:15:47] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[09:16:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29585 and previous config saved to /var/cache/conftool/dbconfig/20220609-091630-marostegui.json
[09:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:23] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:19:23] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:24:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:24:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:24:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298560)', diff saved to https://phabricator.wikimedia.org/P29586 and previous config saved to /var/cache/conftool/dbconfig/20220609-092413-ladsgroup.json
[09:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:21] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[09:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:26:45] <Amir1>	 !log killed enwiki's refreshlinksrecommandations (T299021)
[09:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:51] <stashbot>	 T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021
[09:28:01] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:30:15] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:31:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29587 and previous config saved to /var/cache/conftool/dbconfig/20220609-093135-marostegui.json
[09:31:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[09:31:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[09:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:31:41] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:31:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29588 and previous config saved to /var/cache/conftool/dbconfig/20220609-093148-marostegui.json
[09:31:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847)
[09:36:49] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:37:27] <wikibugs>	 (03PS1) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276
[09:38:05] <wikibugs>	 (03PS2) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276
[09:39:01] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:45:47] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:47:49] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:50:03] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:56:34] <wikibugs>	 (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio)
[09:58:14] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons.
[09:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1000).
[10:07:03] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:12:46] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[10:16:01] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[10:16:17] <XioNoX>	 looking
[10:16:41] <godog>	 checking too
[10:16:50] <XioNoX>	 looks like a spike for france
[10:17:46] <godog>	 indeed, doesn't seem actionable right now to me ?
[10:17:58] <jayme>	 👀
[10:18:18] <XioNoX>	 looks like it's alredy going down too
[10:18:56] <godog>	 sorry I forgot to ack, doing so now
[10:19:04] <XioNoX>	 checking librenms
[10:19:50] <XioNoX>	 we can see a drop of inbound traffic
[10:20:09] <XioNoX>	 on telia mostly
[10:20:32] <XioNoX>	 and drop of outbound, most likely as a consequence on telia and orange
[10:20:33] <XioNoX>	 https://librenms.wikimedia.org/graphs/to=1654769700/id=23135/type=port_bits/from=1654748100/
[10:21:01] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[10:21:52] <godog>	 indeed, and looks like it is coming back now
[10:22:00] <godog>	 "now" as in, the next datapoint in librenms
[10:22:18] <XioNoX>	 indeed
[10:22:25] <godog>	 can see the same here https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=now-1h&to=now&var-site=drmrs&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4
[10:22:26] <XioNoX>	 so looks like a telia "blip"
[10:23:05] <godog>	 agreed
[10:23:25] <godog>	 going back to my lunch, happy to discuss tuning NEL too later
[10:28:33] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:32:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29590 and previous config saved to /var/cache/conftool/dbconfig/20220609-103204-marostegui.json
[10:32:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:09] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[10:47:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29591 and previous config saved to /var/cache/conftool/dbconfig/20220609-104709-marostegui.json
[10:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @nskaggs anyone with access to Netbox and ability to run homer (which I believe should be most of SRE) shoul...
[10:50:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10Tobi_WMDE_SW) >>! In T310055#7989747, @KFrancis wrote: > @MoritzMuehlenhoff Thanks for checking in.  Because Goran is no longer an employee of WMDE, I should process...
[10:54:37] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:55:08] <mmandere>	 !log restart cp5006
[10:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:23] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 238.79 ms
[11:02:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29592 and previous config saved to /var/cache/conftool/dbconfig/20220609-110214-marostegui.json
[11:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T310011)', diff saved to https://phabricator.wikimedia.org/P29593 and previous config saved to /var/cache/conftool/dbconfig/20220609-111719-marostegui.json
[11:17:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:17:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:25] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:32] <mmandere>	 !log pool cp5006 after restart
[11:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:05] <logmsgbot>	 !log mmandere@cumin1001 conftool action : set/pooled=yes; selector: name=cp5006.*
[11:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:29:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[11:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29594 and previous config saved to /var/cache/conftool/dbconfig/20220609-112945-marostegui.json
[11:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:48] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:35:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:38:54] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons.
[11:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:18] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[11:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29595 and previous config saved to /var/cache/conftool/dbconfig/20220609-114740-marostegui.json
[11:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:46] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:52:27] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[11:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29596 and previous config saved to /var/cache/conftool/dbconfig/20220609-120245-marostegui.json
[12:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti3002.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti3002.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[12:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29597 and previous config saved to /var/cache/conftool/dbconfig/20220609-121750-marostegui.json
[12:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff)
[12:23:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti3002 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye.
[12:32:06] <wikibugs>	 10SRE, 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10Jclark-ctr) 05Open→03Resolved Reseated power cable
[12:32:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29598 and previous config saved to /var/cache/conftool/dbconfig/20220609-123256-marostegui.json
[12:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:33:01] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:33:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance
[12:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance
[12:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:39:04] <wikibugs>	 10SRE, 10Analytics: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10BTullis) @hashar Is this latency with Archiva still apparent? I guess it probably is, since you had to increase the timeouts again in February of this year.
[12:43:00] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10BTullis) Adding the #data-engineering tag so that this ticket does not get dropped when we deprecate #analytics.
[12:45:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[12:45:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[12:45:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29599 and previous config saved to /var/cache/conftool/dbconfig/20220609-124529-marostegui.json
[12:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:34] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:47:37] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti1023 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[12:47:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860)
[12:48:00] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[12:49:23] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:49:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[12:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:38] <moritzm>	 !log installing xen security updates (client-side libs only)
[12:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:35] <moritzm>	 !log installing libjpeg-turbo security updates
[13:00:36] <wikibugs>	 10SRE, 10Analytics, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10BTullis) Should we decline this ticket?
[13:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29600 and previous config saved to /var/cache/conftool/dbconfig/20220609-130042-marostegui.json
[13:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:47] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:01:06] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: switch over from Debian package to self-installed scap [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559)
[13:01:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: ""experimental" for now to have sth to put out there and iterate on" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:03:32] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul)
[13:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:10:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Re-enable OSM sync in codfw [puppet] - 10https://gerrit.wikimedia.org/r/803893 (owner: 10MSantos)
[13:15:09] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1062.eqiad.wmnet with OS bullseye
[13:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:13] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1062.eqiad.wmnet with OS bullseye
[13:15:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[13:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29601 and previous config saved to /var/cache/conftool/dbconfig/20220609-131548-marostegui.json
[13:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:20] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: remove scap Debian package from targets [puppet] - 10https://gerrit.wikimedia.org/r/804311 (https://phabricator.wikimedia.org/T303559)
[13:17:49] <wikibugs>	 (03PS4) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[13:19:07] <wikibugs>	 (03CR) 10JMeybohm: black format cookbooks/sre/__init__.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm)
[13:19:11] <wikibugs>	 (03Abandoned) 10JMeybohm: black format cookbooks/sre/__init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 (owner: 10JMeybohm)
[13:20:22] <wikibugs>	 (03PS5) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[13:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:22:00] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10JAllemandou) >>! In T232795#7991992, @BT...
[13:23:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[13:26:06] <wikibugs>	 (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[13:27:46] <wikibugs>	 (03PS6) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[13:30:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29602 and previous config saved to /var/cache/conftool/dbconfig/20220609-133053-marostegui.json
[13:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::ferm_misc: Remove old buster IDP nodes [puppet] - 10https://gerrit.wikimedia.org/r/803883 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[13:34:39] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1062.eqiad.wmnet with reason: host reimage
[13:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:08] <wikibugs>	 (03CR) 10Ayounsi: netops: add PingUnavailable alert (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:37:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1062.eqiad.wmnet with reason: host reimage
[13:37:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T310011)', diff saved to https://phabricator.wikimedia.org/P29603 and previous config saved to /var/cache/conftool/dbconfig/20220609-134558-marostegui.json
[13:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:06] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:47:56] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[13:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860)
[13:50:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the feedback!" [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:54:19] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1062.eqiad.wmnet with OS bullseye
[13:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:24] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1062.eqiad.wmnet with OS bullseye completed: - ms-be1062 (**PASS**)   - Downtim...
[13:57:34] <wikibugs>	 (03CR) 10Ayounsi: netops: add PingUnavailable alert (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[13:59:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749
[14:07:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1063.eqiad.wmnet with OS bullseye
[14:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:00] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1063.eqiad.wmnet with OS bullseye
[14:09:31] <moritzm>	 !log masking Excimer/Arclamp services/timers on webperf1002/2002 T305460
[14:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:35] <stashbot>	 T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460
[14:11:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: netops: add PingUnavailable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[14:12:16] <wikibugs>	 (03PS3) 10Filippo Giunchedi: netops: add PingUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860)
[14:14:18] <wikibugs>	 (03CR) 10Ayounsi: netops: add PingUnavailable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[14:15:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:43] <icinga-wm>	 PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:52] <wikibugs>	 (03PS4) 10Filippo Giunchedi: netops: add PingUnreachable alert [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860)
[14:16:54] <moritzm>	 ^this is expected, I'll downtime/ack
[14:17:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: netops: add PingUnreachable alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[14:17:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on webperf1002.eqiad.wmnet with reason: Migration to new Bullseye nodes
[14:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on webperf1002.eqiad.wmnet with reason: Migration to new Bullseye nodes
[14:17:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on webperf2002.codfw.wmnet with reason: Migration to new Bullseye nodes
[14:17:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on webperf2002.codfw.wmnet with reason: Migration to new Bullseye nodes
[14:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:01] <icinga-wm>	 RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:25] <wikibugs>	 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[14:26:16] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1063.eqiad.wmnet with reason: host reimage
[14:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:20] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1063.eqiad.wmnet with reason: host reimage
[14:29:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:32:41] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1064 - https://phabricator.wikimedia.org/T310181 (10Cmjohnson) 05Open→03Declined duplicate
[14:35:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED
[14:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749 (owner: 10Muehlenhoff)
[14:36:21] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: include tcp probes in alerts [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:43:56] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330
[14:44:24] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:28] <wikibugs>	 (03PS2) 10Cwhite: logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565)
[14:45:35] <wikibugs>	 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[14:45:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460)
[14:48:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED
[14:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson)
[14:52:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[14:52:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: truncate labels.normalized_message [puppet] - 10https://gerrit.wikimedia.org/r/804010 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[14:53:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:22] <wikibugs>	 (03CR) 10BCornwall: Traffic: Add PyBal BGP sessions (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[14:55:02] <wikibugs>	 (03CR) 10Ahmon Dancy: "LGTM overall." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle)
[14:55:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:55:19] <wikibugs>	 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[14:56:04] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster
[14:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster
[14:57:38] <wikibugs>	 (03PS1) 10Muehlenhoff: ALso point codfw to the new node [puppet] - 10https://gerrit.wikimedia.org/r/804333 (https://phabricator.wikimedia.org/T305460)
[14:58:39] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10ayounsi) 05Declined→03Open The topic came back again Today as hosts requests in T307641 got provisioned without the additional IPs requiring heavy manual work to get it fixed. And it's not t...
[14:59:31] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ALso point codfw to the new node [puppet] - 10https://gerrit.wikimedia.org/r/804333 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[15:00:00] <wikibugs>	 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) 05Resolved→03Open Re-opening as those were not provisioned as cassandra hosts and the additional DNS records where not generated by the provisioning scr...
[15:00:10] <wikibugs>	 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[15:01:13] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat
[15:01:13] <icinga-wm>	 ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[15:01:37] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:03:16] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:03:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:20] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[15:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:37] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for
[15:05:37] <icinga-wm>	 ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX
[15:07:33] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1063.eqiad.wmnet with OS bullseye
[15:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:37] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1063.eqiad.wmnet with OS bullseye completed: - ms-be1063 (**PASS**)   - Downtim...
[15:10:07] <wikibugs>	 (03PS3) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723)
[15:10:19] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460)
[15:15:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: bullseye upgrade - bking@cumin1001 - T289135
[15:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:13] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[15:16:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460)
[15:18:01] <wikibugs>	 (03PS1) 10Muehlenhoff: coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460)
[15:19:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2036.codfw.wmnet with OS bullseye
[15:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:49] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:20:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460)
[15:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:33:51] <wikibugs>	 (03PS1) 10Ayounsi: CDN: disable caching for netbox-exports [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452)
[15:35:13] <wikibugs>	 (03PS2) 10DCausse: [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869)
[15:35:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage
[15:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[15:38:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage
[15:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:12] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[15:44:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] CDN: disable caching for netbox-exports [puppet] - 10https://gerrit.wikimedia.org/r/804345 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[15:45:18] <wikibugs>	 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) I've created the records for 2 cassandra instances (`-a` and `-b`) in Netbox. Changelog: https://netbox.wikimedia.org/extras/changelog/?request_id=774eba10-...
[15:46:37] <XioNoX>	 !log set cache "pass" to netbox-exports
[15:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:43] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:29] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[15:48:13] <icinga-wm>	 PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:36] <inflatador>	 ^^ these are getting reimaged ATM, not sure why alerts are still on
[15:48:42] <inflatador>	 will try and silence
[15:52:21] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster
[15:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster execute...
[15:53:08] <wikibugs>	 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Volans) 05Open→03Resolved Re-resolving.
[15:53:12] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[15:53:21] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope)
[15:53:30] <moritzm>	 !log installing curl security updates
[15:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:41] <wikibugs>	 (03PS3) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246)
[15:54:12] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[15:56:45] <wikibugs>	 10SRE, 10Observability-Alerting: Aggregate check_mw_versions alerts for each individual app server - https://phabricator.wikimedia.org/T251942 (10Krinkle)
[15:57:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001
[15:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:09] <jinxer-wm>	 (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4066 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh
[15:58:40] <godog>	 mhhh checking
[15:58:43] <jayme>	 sweet :)
[15:58:58] <jhathaway>	 looking as well
[15:59:16] <godog>	 thank you jhathaway !
[16:00:05] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:36] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2036.codfw.wmnet with OS bullseye
[16:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:47] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye completed...
[16:00:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[16:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[16:04:59] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet
[16:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:43] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans)
[16:09:08] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet
[16:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:55] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[16:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2054.codfw.wmnet with OS bullseye
[16:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[16:10:11] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye
[16:13:09] <jinxer-wm>	 (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4074 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh
[16:14:56] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2054.codfw.wmnet with OS bullseye
[16:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:04] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2054.codfw.wmnet with OS bullseye executed...
[16:16:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Papaul) I tested the pxe boot on an-worker1142 and server was not getting anything from dhcp server after debug , I found out that the server...
[16:17:04] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: bullseye upgrade - bking@cumin1001 - T289135
[16:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:09] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[16:21:57] <wikibugs>	 (03PS3) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276
[16:29:07] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:32:59] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:36:17] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330 (owner: 10MSantos)
[16:38:03] <wikibugs>	 (03CR) 10Cathal Mooney: netops: add PingUnreachable alert (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/804304 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[16:39:51] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2022-06-06-111800-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/804330 (owner: 10MSantos)
[16:43:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs: Rolling AQS Cassandra cluster to pick up new encryption settings - btullis@cumin1001
[16:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:56] <dancy>	 jouncebot nowandnext
[16:46:56] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1600)
[16:46:56] <jouncebot>	 In 1 hour(s) and 13 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800)
[16:52:18] <rzl>	 dancy: nothing happening in the puppet window
[16:52:29] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: testing
[16:52:30] <dancy>	 👍🏾 Thanks Reuven
[16:52:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:49] <wikibugs>	 (03PS1) 10Esanders: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395
[17:01:38] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3002.esams.wmnet with OS bullseye
[17:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye
[17:03:41] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) elastic2053.codfw.wmnet and elastic2054.codfw.wmnet both failed to reimage, this could be the same outdated firmware issue we saw when r...
[17:04:14] <wikibugs>	 10SRE, 10Analytics: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) @BTullis yes archiva is still rather slow. From the verbose curl commands above T273086#6783722, there is a large delay  (1+ seconds) before the transfer start and t...
[17:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:10:26] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:11:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:11:36] <wikibugs>	 (03PS2) 10Dzahn: vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942)
[17:11:39] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:12:12] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:43] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:12:46] <wikibugs>	 (03CR) 10Ottomata: airflow:manifests:instance.pp: Bump up number of DAG processors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803973 (owner: 10Mforns)
[17:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:16] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:35] <wikibugs>	 (03PS1) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922
[17:14:06] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:21] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:06] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:57] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3002.esams.wmnet with reason: host reimage
[17:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:14] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "typo in here, Arnold following up" [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:22:22] <wikibugs>	 (03PS1) 10AOkoth: vrts: fix type in timer name [puppet] - 10https://gerrit.wikimedia.org/r/804397 (https://phabricator.wikimedia.org/T293942)
[17:23:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: fix type in timer name [puppet] - 10https://gerrit.wikimedia.org/r/804397 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:23:42] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3002.esams.wmnet with reason: host reimage
[17:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:53] <wikibugs>	 (03PS1) 10AOkoth: vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942)
[17:29:08] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:29:23] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet
[17:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:31:12] <wikibugs>	 (03CR) 10Dzahn: "let's not change the user name yet" [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:31:16] <icinga-wm>	 PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: vtrs_train_spamassassin.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:09] <mutante>	 ^ on that
[17:32:16] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1064.eqiad.wmnet with OS bullseye
[17:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:23] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1064.eqiad.wmnet with OS bullseye
[17:32:31] <wikibugs>	 (03PS2) 10AOkoth: vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942)
[17:33:04] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:33:52] <icinga-wm>	 RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:18] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet
[17:34:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename cleanup cache service [puppet] - 10https://gerrit.wikimedia.org/r/804398 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth)
[17:39:29] <dduvall>	 i think i'll re-roll group0 a little early unless someone objects. we're doing group0/group1/all today if all goes well with the former two
[17:39:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[17:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:51] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[17:40:06] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:40:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:40:19] <wikibugs>	 (03PS2) 10Dzahn: vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942)
[17:40:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[17:40:24] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:40:43] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3002.esams.wmnet with OS bullseye
[17:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye completed: - ganeti3002 (**PASS**)   - Downtimed on Icinga/Ale...
[17:41:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:42:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3002 firmware updated for nic, bios, and idrac.  reimaged and ready for next one after you juggle this back into service =]
[17:42:51] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:43:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:44:17] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[17:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:25] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[17:44:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[17:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:41] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[17:46:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1064.eqiad.wmnet with reason: host reimage
[17:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:14] <wikibugs>	 (03PS1) 10Dduvall: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068)
[17:48:16] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[17:48:16] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[17:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:24] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[17:48:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[17:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:36] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[17:49:39] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804403 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[17:49:45] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1064.eqiad.wmnet with reason: host reimage
[17:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:34] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.15  refs T308068
[17:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:38] <stashbot>	 T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068
[17:54:27] <wikibugs>	 (03Abandoned) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[17:54:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:55:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:34] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[17:57:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:00:04] <jouncebot>	 dduvall and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800).
[18:01:11] <wikibugs>	 (03Restored) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[18:01:35] <wikibugs>	 (03PS2) 10Dduvall: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922
[18:03:04] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[18:04:15] <logmsgbot>	 !log dduvall@deploy1002 backport aborted:  (duration: 00m 08s)
[18:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:16] <hauskatze>	 jouncebot: nowandnext
[18:06:16] <jouncebot>	 For the next 1 hour(s) and 53 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T1800)
[18:06:16] <jouncebot>	 In 1 hour(s) and 53 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T2000)
[18:10:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1064.eqiad.wmnet with OS bullseye
[18:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:52] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1064.eqiad.wmnet with OS bullseye completed: - ms-be1064 (**PASS**)   - Downtim...
[18:13:00] <wikibugs>	 (03PS1) 10Dzahn: prometheus:ops: rename otrs references to vrts [puppet] - 10https://gerrit.wikimedia.org/r/804416 (https://phabricator.wikimedia.org/T293942)
[18:13:43] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] "Approved via scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[18:20:50] <wikibugs>	 (03Merged) 10jenkins-bot: Truncate failed requests errors to 4kB [extensions/CirrusSearch] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803922 (owner: 10Dduvall)
[18:21:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:22] <logmsgbot>	 !log dduvall@deploy1002 Started scap: Backport for [[gerrit:803922]] Truncate failed requests errors to 4kB
[18:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Magnum: switch config to expect rabbit over TLS [puppet] - 10https://gerrit.wikimedia.org/r/804419
[18:26:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Magnum: switch config to expect rabbit over TLS [puppet] - 10https://gerrit.wikimedia.org/r/804419 (owner: 10Andrew Bogott)
[18:26:30] <logmsgbot>	 !log dduvall@deploy1002 Finished scap: Backport for [[gerrit:803922]] Truncate failed requests errors to 4kB (duration: 04m 08s)
[18:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:55] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068)
[18:26:57] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:27:44] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804420 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:28:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:28:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:31:33] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.15  refs T308068
[18:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:36] <stashbot>	 T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068
[18:34:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:07] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.15  refs T308068 (duration: 03m 34s)
[18:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:40] <wikibugs>	 (03PS1) 10Andrew Bogott: magnum: fill in the [keystone_auth] section [puppet] - 10https://gerrit.wikimedia.org/r/804423
[18:46:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] magnum: fill in the [keystone_auth] section [puppet] - 10https://gerrit.wikimedia.org/r/804423 (owner: 10Andrew Bogott)
[18:46:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:46:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:46:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:36] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068)
[18:50:42] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:51:59] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804425 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall)
[18:52:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:26] <ryankemper>	 !log T309648 Copied newly built `wmf-elasticsearch-search-plugins` from stretch to bullseye (`root@apt1001:/home/ryankemper# reprepro copy bullseye-wikimedia stretch-wikimedia wmf-elasticsearch-search-plugins`); then ran `apt update` on `relforge*`; new plugin package showing as available now: `6.8.23-3~stretch 1001`
[18:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:30] <stashbot>	 T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
[18:54:08] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[18:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:00] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.15  refs T308068
[18:56:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:07] <stashbot>	 T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068
[18:57:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:27] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[18:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:30] <stashbot>	 T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
[19:02:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:02:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:50] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 137 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 140, active_shards: 140, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 137, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num
[19:02:50] <icinga-wm>	 n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.54151624548736 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:04:52] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0,
[19:04:52] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:06:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:47] <wikibugs>	 (03CR) 10Daniel Kinzler: "James saind on Slack that it's fine. But it needs manual deployment. We can do that together next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[19:17:25] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[19:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:33] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[19:21:32] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[19:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:39] <stashbot>	 T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648
[19:21:41] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648
[19:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:22] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[19:24:22] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:24:28] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 140, active_shards: 280, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
[19:24:28] <icinga-wm>	 _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:32:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:14] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:36:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:20] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431
[19:39:14] <wikibugs>	 (03PS2) 10Andrew Bogott: magnum keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431
[19:40:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] magnum keystone_auth: remove domain setting [puppet] - 10https://gerrit.wikimedia.org/r/804431 (owner: 10Andrew Bogott)
[19:43:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:40] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1004 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[19:46:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1014.mgmt.eqiad.wmnet with reboot policy FORCED
[19:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:30] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[19:46:44] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle Yep, that summary sounds right to me. That's wha...
[19:47:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:38] <icinga-wm>	 RECOVERY - Check for snapshots leaked by cinder backup agent on cloudcontrol1003 is OK: 0 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent
[19:51:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:48] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:53] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[19:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:12] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] Remove webperf1002/webperf2002 from Kafka firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/804334 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[19:58:26] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[19:58:37] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[19:58:58] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] coal: Remove support for pre Bullseye installs [puppet] - 10https://gerrit.wikimedia.org/r/804340 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[19:59:41] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:05] <jouncebot>	 brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220609T2000).
[20:00:05] <jouncebot>	 mewoph and hauskatze: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:13] <hauskatze>	 o/
[20:00:23] <mewoph>	 👋
[20:00:27] <thcipriani>	 howdy! I can deploy today.
[20:01:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1014.mgmt.eqiad.wmnet with reboot policy FORCED
[20:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:19] <mutante>	 train the trainers
[20:01:31] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[20:01:41] <thcipriani>	 somebody has to :)
[20:01:45] <mutante>	 :)
[20:02:54] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio)
[20:03:30] <wikibugs>	 (03PS1) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437
[20:03:37] <thcipriani>	 hauskatze: do I need to run namespacedupes after I sync your change?
[20:03:39] <hauskatze>	 Who needs training, if they do it for you? :)
[20:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803916 (https://phabricator.wikimedia.org/T309866) (owner: 10MarcoAurelio)
[20:03:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2053.codfw.wmnet
[20:03:49] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host elastic2053.codfw.wmnet
[20:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:02] <hauskatze>	 thcipriani: let me see if it applies clean on mwdebug first
[20:04:11] <hauskatze>	 1 or 2 by the way?
[20:04:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs)
[20:04:34] <thcipriani>	 hauskatze: I only see one patch?
[20:04:47] <hauskatze>	 thcipriani: yep, but which mwdebug server are we using?
[20:05:00] <hauskatze>	 1001 or 1002 ?
[20:05:25] <thcipriani>	 hauskatze: should be on mwdebug1002 now
[20:05:32] <hauskatze>	 checking
[20:05:39] <hauskatze>	 at least wgSiteName should be checkable
[20:06:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:06:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:01] <hauskatze>	 thcipriani: lgtm; namespaceDupes is to be run first *without* --fix (dryrun)
[20:07:16] <thcipriani>	 :) on it.
[20:07:35] <hauskatze>	 if you can Paste the output it'd be nice
[20:07:51] <thcipriani>	 I do try to be nice
[20:08:30] <hauskatze>	 maybe phaste can do it for you?
[20:08:47] <thcipriani>	 I never remember how to do that from the servers (IIRC there's a way
[20:08:49] <thcipriani>	 )
[20:08:52] <thcipriani>	 I'll just copy and paste
[20:08:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1015.mgmt.eqiad.wmnet with reboot policy FORCED
[20:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:13] <wikibugs>	 (03PS2) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437
[20:09:32] <hauskatze>	 never used it myself but it might be something like mwscript [blah blah] | phaste
[20:10:11] <wikibugs>	 (03PS3) 10Nskaggs: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437
[20:10:13] <wikibugs>	 (03CR) 10jenkins-bot: Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs)
[20:10:37] <thcipriani>	 hauskatze: https://phabricator.wikimedia.org/P29607
[20:10:44] <hauskatze>	 checking
[20:11:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:20] <hauskatze>	 thcipriani: looks great, all links can be fixed by the script
[20:11:28] <thcipriani>	 looks like it
[20:11:32] <hauskatze>	 I think we can deploy and run the script aftewards with --fix
[20:11:40] <thcipriani>	 perfect, doing now
[20:11:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033)
[20:11:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:11:58] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033)
[20:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:20] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033)
[20:13:22] <wikibugs>	 (03PS2) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033)
[20:13:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10nskaggs) Thanks for the explanation. I just want to make sure if not a cookbook, then a runbook at least to make it v...
[20:14:06] <wikibugs>	 (03CR) 10Dzahn: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:14:27] <wikibugs>	 (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:14:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Fix keyword error: s/wait_random/wait_exponential [puppet] - 10https://gerrit.wikimedia.org/r/804437 (owner: 10Nskaggs)
[20:15:34] <thcipriani>	 hrm, php-fpm check-and-restart taking a while
[20:16:12] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:803916|kywiki: Add $wgSitename, $wgMetaNamespace & $wgMetaNamespaceTalk (T309866)]] (duration: 03m 36s)
[20:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:17] <stashbot>	 T309866: Localisation of the namespaces in the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T309866
[20:17:30] <wikibugs>	 (03PS3) 10Ahmon Dancy: Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033)
[20:17:32] <wikibugs>	 (03PS3) 10Ahmon Dancy: scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033)
[20:18:01] <thcipriani>	 there we go, running namespacedupes --fix now hauskatze 
[20:18:13] <thcipriani>	 !log mwmaint1002:mwscript namespaceDupes.php kywiki --fix
[20:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:27] <wikibugs>	 (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:18:27] <hauskatze>	 thanks :)
[20:20:33] <thcipriani>	 updated the same paste with the output: tl;dr: 276 fixed
[20:21:24] <thcipriani>	 dancy: are we now forcing restarts?
[20:21:32] <hauskatze>	 checking
[20:21:59] <dancy>	 thcipriani: Yes, always restart is enabled now
[20:22:08] <dancy>	 so it takes about 3 minutes to complete restarts.
[20:22:30] <hauskatze>	 thcipriani: output looks good to me
[20:22:39] <thcipriani>	 hauskatze: nice, thanks for checking :)
[20:22:41] <hauskatze>	 unless you feel otherwise?
[20:22:50] <thcipriani>	 nope, lgtm, too
[20:22:55] <hauskatze>	 great
[20:23:22] <hauskatze>	 I'll make a note in the task that the other translations for Scribunto and Gadgets and MWCore will take up to one week to display over there
[20:23:26] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1015.mgmt.eqiad.wmnet with reboot policy FORCED
[20:23:27] <thcipriani>	 dancy: I note there was a spike in errors with the last deploy, looks like jobrunners at a glance—expected?
[20:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:39] <hauskatze>	 and the l10nupdate script will take care of updating everything iirc
[20:23:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED
[20:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:02] <dancy>	 thcipriani: jobrunners are excluded from restarts,
[20:24:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "+1 to the key names. this matches prod and they were missing. Can't speak for the actual machine names but ok to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:25:41] <wikibugs>	 (03Merged) 10jenkins-bot: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803969 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[20:26:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster
[20:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster
[20:26:52] <thcipriani>	 hrm, no, it's not jobrunners, it's mw hosts, but it's doing something with RunSingleJob
[20:27:25] <thcipriani>	 seems to have stopped, but that was a prolonged wave :\
[20:28:13] <wikibugs>	 (03Abandoned) 10Dzahn: delete expired ldap-corp certificates [puppet] - 10https://gerrit.wikimedia.org/r/791677 (owner: 10Dzahn)
[20:28:41] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "phabricator: allow disabling ssh-phab service except on one host"" [puppet] - 10https://gerrit.wikimedia.org/r/778243
[20:29:31] <wikibugs>	 (03CR) 10Ahmon Dancy: Add new dsh groups for beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:29:33] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Tested in beta" [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:30:24] <thcipriani>	 mewoph: still around? (sorry for delay)
[20:30:40] <mewoph>	 thcipriani: yes! no problem
[20:31:41] <thcipriani>	 mewoph: your change is on mwdebug1002, check please
[20:32:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add new dsh groups for beta [puppet] - 10https://gerrit.wikimedia.org/r/804440 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:32:50] <mewoph>	 thcipriani: lgtm
[20:33:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:33:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:34] <hauskatze>	 thanks for deploying my patch thcipriani - always a pleasure :)
[20:34:50] <thcipriani>	 hauskatze: sure thing, and likewise :)
[20:35:04] <thcipriani>	 mewoph: cool, thanks for checking, syncing
[20:35:15] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster
[20:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster
[20:36:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster
[20:36:05] <wikibugs>	 (03CR) 10Dzahn: "You might need  'profile::swift::accounts_keys' too. see https://puppet-compiler.wmflabs.org/pcc-worker1003/35803/deployment-webperf22.dep" [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster
[20:36:22] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:37:31] <mewoph>	 thcipriani: thank you!
[20:37:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster
[20:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster
[20:38:21] <wikibugs>	 (03Abandoned) 10MewOphaswongse: Suggested edits: Fix loading states when fetching additional tasks [extensions/GrowthExperiments] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803971 (https://phabricator.wikimedia.org/T309926) (owner: 10MewOphaswongse)
[20:38:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster
[20:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster
[20:39:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr)
[20:39:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "works on deployment-deploy03.deployment-prep and deployment-mediawiki11.deployment-prep.eqiad1.wikimedia.cloud. error above just affects w" [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:39:52] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/GrowthExperiments/modules: Backport: [[gerrit:803969|Suggested edits: Fix loading states when fetching additional tasks (T309926)]] (duration: 03m 37s)
[20:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:55] <stashbot>	 T309926: Suggested edits: edits browsing bug - https://phabricator.wikimedia.org/T309926
[20:39:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1016.mgmt.eqiad.wmnet with reboot policy FORCED
[20:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:06] <thcipriani>	 mewoph: ^ should be live now, and you're welcome :)
[20:40:10] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster
[20:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster exec...
[20:40:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:43:52] <thcipriani>	 !log end utc late backport window
[20:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:04] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) >>! In T269328#7992477, @ayounsi wrote: > The topic came back again Today as hosts requests in T307641 got provisioned without the additional IPs requiring heavy manual work to get it fi...
[20:44:46] <wikibugs>	 (03CR) 10Ahmon Dancy: [V: 03+1] "Tested in beta.  Works now that the groups are set up." [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[20:44:58] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[20:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:04] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster
[20:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[20:45:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster
[20:46:38] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1144.eqiad.wmnet with OS buster
[20:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster exec...
[20:46:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1146.eqiad.wmnet with OS buster
[20:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec...
[20:47:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1145.eqiad.wmnet with OS buster
[20:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec...
[20:49:40] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[20:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:48] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[20:52:06] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) Also, via https://cassandra.apache.org/_/blog/Configurable-Storage-Ports-and-Why-We-Need-Them.html:  > ### How Do My Other Cassandra Nodes Know About Different storage_port Settings? > !...
[20:52:16] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1143.eqiad.wmnet with OS buster
[20:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec...
[20:56:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage
[20:56:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:57] <wikibugs>	 (03PS4) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723)
[20:59:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage
[20:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:49] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) And to address @elukey's observation about what the docs say (updated url for that is now [[ https://cassandra.apache.org/doc/trunk/cassandra/configuration/cass_yaml_file.html#seed_provi...
[21:05:38] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:09:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2053.codfw.wmnet with OS bullseye
[21:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:14] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye
[21:09:57] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:10:58] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) On elastic2053.codfw.wmnet:    -  Updated iDRAC to firmware 5.10.10.00  (took 2 updates, first to 3.30.30)   -  Updated NIC firmware to...
[21:12:24] <wikibugs>	 (03PS1) 10BCornwall: Traffic Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723)
[21:12:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1142.eqiad.wmnet with OS buster
[21:12:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster comp...
[21:13:25] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2053.codfw.wmnet with OS bullseye
[21:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:32] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2053.codfw.wmnet with OS bullseye executed...
[21:13:59] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) Reimaging failed again, will try again with a different host when work resumes (maybe next week?)
[21:16:37] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:20:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:25:05] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:33:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap.cfg.erb: Define php_fpm restart settings for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/804441 (https://phabricator.wikimedia.org/T237033) (owner: 10Ahmon Dancy)
[21:34:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Stalled→03Open
[21:34:09] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet
[21:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:11] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:36:41] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:38:09] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet
[21:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:29] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:50:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) I talked with Jesse about all this. We agreed I will follow-up about the last few things, you Faidon, also mentioned in our mail. cpt-leads@, techchom...
[21:50:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Open→03In progress
[22:03:45] <wikibugs>	 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) Ok, so I do need to walk some of this back.  It //is// now possible in 4.x (the documentation is correct in that context), thanks to [[ https://issues.apache.org/jira/browse/CASSANDRA-75...
[22:11:03] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:20:47] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:30:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:36:07] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:02:06] <wikibugs>	 (03PS1) 10Zabe: httpbb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804465 (https://phabricator.wikimedia.org/T308013)
[23:02:08] <wikibugs>	 (03PS1) 10Zabe: galera: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804466 (https://phabricator.wikimedia.org/T308013)
[23:02:10] <wikibugs>	 (03PS1) 10Zabe: fifo_log_demux: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804467 (https://phabricator.wikimedia.org/T308013)
[23:02:12] <wikibugs>	 (03PS1) 10Zabe: external_proxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804468 (https://phabricator.wikimedia.org/T308013)
[23:02:14] <wikibugs>	 (03PS1) 10Zabe: external_clouds_vendors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804469 (https://phabricator.wikimedia.org/T308013)
[23:02:16] <wikibugs>	 (03PS1) 10Zabe: eventschemas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804470 (https://phabricator.wikimedia.org/T308013)
[23:02:18] <wikibugs>	 (03PS1) 10Zabe: etcdmirror: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804471 (https://phabricator.wikimedia.org/T308013)
[23:02:20] <wikibugs>	 (03PS1) 10Zabe: envoyproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804472 (https://phabricator.wikimedia.org/T308013)
[23:02:22] <wikibugs>	 (03PS1) 10Zabe: dumpsuser: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804473 (https://phabricator.wikimedia.org/T308013)
[23:02:24] <wikibugs>	 (03PS1) 10Zabe: docker_registry_ha: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013)
[23:02:27] <wikibugs>	 (03PS1) 10Zabe: docker_pusher: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804475 (https://phabricator.wikimedia.org/T308013)
[23:02:29] <wikibugs>	 (03PS1) 10Zabe: docker_pkg: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804476 (https://phabricator.wikimedia.org/T308013)
[23:02:31] <wikibugs>	 (03PS1) 10Zabe: cpufrequtils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804477 (https://phabricator.wikimedia.org/T308013)
[23:02:33] <wikibugs>	 (03PS1) 10Zabe: conntrackd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/804478 (https://phabricator.wikimedia.org/T308013)
[23:16:27] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:22:08] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "I used cumin to verify that all hosts already have cgroup-tools installed, so this is a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar)
[23:22:15] <wikibugs>	 (03PS4) 10Legoktm: mediawiki: Use non-transitional cgroups package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar)
[23:25:28] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:30:10] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki: Use non-transitional cgroups package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar)
[23:31:04] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Package 'cgroup-bin' has no installation candidate on Debian 11 (modules/mediawiki/manifests/cgroup.pp) - https://phabricator.wikimedia.org/T309449 (10Legoktm) 05Open→03Resolved
[23:35:29] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] docker_registry_ha: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/804474 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[23:35:33] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:38:53] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:43:43] <legoktm>	 !issync
[23:43:44] <ircservserv-wm_>	 Syncing #wikimedia-operations (requested by legoktm)
[23:43:46] <ircservserv-wm_>	 Set /cs flags #wikimedia-operations topranks +Aiotv
[23:43:48] <ircservserv-wm_>	 Set /cs flags #wikimedia-operations rzl +Aiotv
[23:44:12] <rzl>	 thanks!
[23:44:51] <legoktm>	 yw :)
[23:45:21] <TheresNoTime>	 thanks for the +2 legoktm :)
[23:48:55] <legoktm>	 yw too!
[23:54:18] <wikibugs>	 (03PS6) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)