[00:06:52] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-28 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:19:56] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-28 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:23:56] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-06-28 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:39:15] <wikibugs>	 (03PS4) 10Dzahn: vrts: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190)
[00:44:02] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:18] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:49:29] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) Thanks @Cmjohnson. There is another host,   ` 20:45:18 <+icinga-wm> PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-...
[00:50:46] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) On second thought, making it another task just for clarity. Sorry for the noise.
[00:50:46] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-05 00:00:02 (3231 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:53:41] <wikibugs>	 (03PS1) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662)
[00:54:22] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10ssingh)
[00:55:44] <wikibugs>	 (03CR) 10Krinkle: Set wgStatsCacheType to mcrouter-primary-dc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[00:58:23] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10Krinkle) 05Open→03Resolved a:03tstarling
[01:00:52] <wikibugs>	 (03PS2) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662)
[01:00:55] <wikibugs>	 (03CR) 10Tim Starling: Set wgStatsCacheType to mcrouter-primary-dc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[01:04:52] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:08:04] <mutante>	 ^ we have an open ticket about that but were hoping it to be fixed this time
[01:10:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "body_regex_matches needs to be an array, same fix as for the gitlab check" [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn)
[01:14:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "double checked puppet on otrs1001. no error" [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) (owner: 10Dzahn)
[01:16:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab2001.wikimedia.org.service,rsync-data-backup-gitlab2001.wikimedia.org.service daniel_zahn https://phabricator.wikimedia.org/T274463 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:00] <mutante>	 the alert is about a systemd unit trying to rsync to a VM that has been decom'ed. that's all 
[01:21:25] <mutante>	 !log gitlab1004 rm /lib/systemd/system/rsync-data-backup-gitlab2001.wikimedia.org.* ; systemctl reset-failed (T274463, T307142) - fix icinga alert after gitlab2001 was decom'ed, we didn't have puppet remove the timer/service 
[01:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:31] <stashbot>	 T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142
[01:21:31] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[01:21:32] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:28:52] <mutante>	 !log gitlab1004 - rm /lib/systemd/system/rsync-config-backup-gitlab2001.wikimedia.org.*
[01:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:46:36] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:47:15] <wikibugs>	 (03PS1) 10DLynch: Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177)
[01:57:54] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-05 00:00:02 (3210 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:19:25] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[02:19:46] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work.
[02:21:38] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:22:58] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] "This is a prerequisite for the WRStats backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[02:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgStatsCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811394 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[02:26:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:27:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:16] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T310662 g 811394 harmless prerequisite (duration: 03m 39s)
[02:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:20] <stashbot>	 T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662
[02:32:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:33:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:41] <wikibugs>	 (03PS1) 10Tim Starling: Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662)
[02:37:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:37:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:36] <wikibugs>	 (03PS3) 10Tim Starling: [MultiDC] Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle)
[02:52:10] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[03:10:45] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce new WRStats library for write-read stats [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811407 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[03:14:14] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-05 00:00:01 (3210 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:17:18] <logmsgbot>	 !log tstarling@deploy1002 Started scap: WRStats core prereq T310662 g811407
[03:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:17:22] <stashbot>	 T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662
[03:18:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:18:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:18:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:19:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:22:56] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:36] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:28:18] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.133 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:34:39] <logmsgbot>	 !log tstarling@deploy1002 Finished scap: WRStats core prereq T310662 g811407 (duration: 17m 20s)
[03:34:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:42] <stashbot>	 T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662
[03:51:55] <wikibugs>	 (03PS1) 10Tim Starling: FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662)
[03:52:56] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[04:06:33] <wikibugs>	 (03Merged) 10jenkins-bot: FilterProfiler: use WRStats [extensions/AbuseFilter] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811408 (https://phabricator.wikimedia.org/T310662) (owner: 10Tim Starling)
[04:15:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[04:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[04:16:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[04:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[04:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:46] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/AbuseFilter: T310662 deployment with possible post-send error spike due to ServiceWiring/FilterProfiler interdependency (duration: 03m 33s)
[04:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:49] <stashbot>	 T310662: Acceptably efficient AbuseFilter profiling storage backend - https://phabricator.wikimedia.org/T310662
[04:19:52] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:22:00] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now
[04:29:14] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now
[04:30:40] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:58] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:29] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Thank you!
[05:04:26] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P30868 and previous config saved to /var/cache/conftool/dbconfig/20220706-050615-root.json
[05:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:08:16] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811485 (https://phabricator.wikimedia.org/T311493)
[05:09:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/811485 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:10:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2159 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P30869 and previous config saved to /var/cache/conftool/dbconfig/20220706-051046-marostegui.json
[05:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:51] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[05:11:29] <wikibugs>	 (03PS1) 10Marostegui: db2158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811567 (https://phabricator.wikimedia.org/T311493)
[05:11:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: codfw s6 sanitarium master switch
[05:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: codfw s6 sanitarium master switch
[05:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:19:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811567 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:21:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: db2076 no longer sanitariu master [puppet] - 10https://gerrit.wikimedia.org/r/811575 (https://phabricator.wikimedia.org/T311493)
[05:21:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P30870 and previous config saved to /var/cache/conftool/dbconfig/20220706-052119-root.json
[05:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: db2076 no longer sanitariu master [puppet] - 10https://gerrit.wikimedia.org/r/811575 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:33:29] <wikibugs>	 (03PS1) 10Marostegui: db2159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811577 (https://phabricator.wikimedia.org/T311493)
[05:33:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: codfw s7 sanitarium master switch
[05:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: codfw s7 sanitarium master switch
[05:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/811577 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:35:28] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (phab1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:36:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P30871 and previous config saved to /var/cache/conftool/dbconfig/20220706-053623-root.json
[05:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:38:02] <wikibugs>	 (03PS1) 10Marostegui: mariadb: db2077 no longer s7 codfw sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811578 (https://phabricator.wikimedia.org/T311493)
[05:39:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: db2077 no longer s7 codfw sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/811578 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[05:41:10] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui)
[05:41:47] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui)
[05:42:18] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2077 - https://phabricator.wikimedia.org/T312191 (10Marostegui)
[05:42:20] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2076 - https://phabricator.wikimedia.org/T312190 (10Marostegui)
[05:45:56] <marostegui>	 !log dbmaint x1@eqiad T312161
[05:45:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:59] <stashbot>	 T312161: Adjust the field type of cx_lists.cxl_start_time/cxl_end_time to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312161
[05:46:12] <marostegui>	 !log dbmaint s3@eqiad T312161
[05:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:55] <marostegui>	 !log dbmaint s3@eqiad T312162
[05:48:57] <marostegui>	 !log dbmaint x1@eqiad T312162
[05:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:58] <stashbot>	 T312162: Adjust the field type of cx_notification_log.cxn_date/cxn_newest to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312162
[05:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P30872 and previous config saved to /var/cache/conftool/dbconfig/20220706-055127-root.json
[05:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:44] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/808910 (https://phabricator.wikimedia.org/T311386)
[06:01:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Agreed!  However,   >>! In T305414#8034695, @Jclark-ctr wrote: > cloudweb1003  c8 u39     20220099   port 10 (cloudsw2-c8-eqiad) > clo...
[06:06:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30873 and previous config saved to /var/cache/conftool/dbconfig/20220706-060631-root.json
[06:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/808910 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto)
[06:16:50] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:21:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30874 and previous config saved to /var/cache/conftool/dbconfig/20220706-062135-root.json
[06:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:47] <XioNoX>	 marostegui: is your work impacting thanos-fe2003 ? it's saturating lvs2009
[06:30:30] <ebernhardson>	 XioNoX: hmm, i just started a snapshot from elastic2* to thanos-swift.discovery.wmnet
[06:30:41] <XioNoX>	 actually it's ton of elastic hosts in codfw flooding lvs2009
[06:30:43] <XioNoX>	 ebernhardson:  :)
[06:30:52] <ebernhardson>	 lemme see if i can back that off, sec
[06:30:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: sre: add php busy workers alerts for parsoid, jobrunners (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto)
[06:30:58] <marostegui>	 XioNoX: nop
[06:31:04] <XioNoX>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2009&viewPanel=8
[06:31:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the maintenance server [puppet] - 10https://gerrit.wikimedia.org/r/808911 (https://phabricator.wikimedia.org/T311386)
[06:31:48] <XioNoX>	 https://librenms.wikimedia.org/device/device=94/tab=port/port=21632/
[06:31:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on the maintenance server [puppet] - 10https://gerrit.wikimedia.org/r/808911 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto)
[06:33:29] <wikibugs>	 (03CR) 10Ayounsi: netops: add DNS probes alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[06:36:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30875 and previous config saved to /var/cache/conftool/dbconfig/20220706-063639-root.json
[06:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:45] <ebernhardson>	 XioNoX: it should be calming itself down now
[06:37:55] <XioNoX>	 ebernhardson: yep, got the recovery, thanks!
[06:39:19] <ebernhardson>	 i wonder if high-bandwidth-ish things (this si trying to move 1.5tb between clusters) would be better avoiding lvs? i suppose there isn't much other option though
[06:43:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Add new user for dbbackups database for django dashboard [puppet] - 10https://gerrit.wikimedia.org/r/810885 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo)
[06:44:40] <XioNoX>	 ebernhardson: LVS should have more bandwidth than hosts they front :)
[06:47:04] <ebernhardson>	 well i was thinking more that lvs architecture is more specialized for small requests and arbitrary responses, but here i'm trying to push 1.5TB through lvs
[06:47:12] <XioNoX>	 ebernhardson: but yeah for now the short term workaround it to bypass it or rate limit it 
[06:48:02] <ebernhardson>	 i've adjusted the rate limit for now, it was 40mb * 32 shards, bumped down to 20mb * 32
[06:48:11] <XioNoX>	 thanks!
[06:51:31] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:51:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30876 and previous config saved to /var/cache/conftool/dbconfig/20220706-065143-root.json
[06:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:34] <wikibugs>	 (03PS4) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff)
[06:58:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36196/backup1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff)
[06:58:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: introduce blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811295 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:58:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move blackbox http check to prometheus::rule [puppet] - 10https://gerrit.wikimedia.org/r/811294 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:58:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy custom probedown alerts [puppet] - 10https://gerrit.wikimedia.org/r/811241 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:58:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: deploy alerts as yml not yaml [puppet] - 10https://gerrit.wikimedia.org/r/811242 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:58:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: switch to blackbox::module [puppet] - 10https://gerrit.wikimedia.org/r/811296 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:58:45] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:59:24] <wikibugs>	 (03PS5) 10Jcrespo: bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff)
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T0700).
[07:00:05] <jouncebot>	 kemayo: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:59] <Kemayo>	 I am, indeed, around.
[07:05:18] <Amir1>	 Kemayo: can you self-service?
[07:05:36] <Kemayo>	 Amir1: I don't think I have the relevant permissions.
[07:06:04] <Amir1>	 ok
[07:06:11] <Amir1>	 backports would take a bit
[07:06:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177) (owner: 10DLynch)
[07:07:30] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:07:32] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1006 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:08:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30878 and previous config saved to /var/cache/conftool/dbconfig/20220706-070835-ladsgroup.json
[07:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:46] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811581 (https://phabricator.wikimedia.org/T311493)
[07:09:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860)
[07:09:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582
[07:10:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:10:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2160 [puppet] - 10https://gerrit.wikimedia.org/r/811581 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[07:11:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:11:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30879 and previous config saved to /var/cache/conftool/dbconfig/20220706-071157-ladsgroup.json
[07:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Hide the lede section on mobile when DiscussionTools is enabled" [extensions/DiscussionTools] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/811406 (https://phabricator.wikimedia.org/T312177) (owner: 10DLynch)
[07:12:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff)
[07:14:38] <Amir1>	 Kemayo: It's live in mwdebug1002
[07:14:44] <Amir1>	 do you know how to test it?
[07:15:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582
[07:15:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula::storage: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810961 (owner: 10Muehlenhoff)
[07:15:51] <Kemayo>	 Amir1: I do, one second
[07:16:48] <Kemayo>	 Amir1: Looks good!
[07:16:57] <Amir1>	 awesome, gonna sync
[07:17:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: expand comments re: probes alerts and puppet [alerts] - 10https://gerrit.wikimedia.org/r/811582 (owner: 10Filippo Giunchedi)
[07:19:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:19:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:19:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:19:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:42] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/DiscussionTools/modules/dt.init.less: Backport: [[gerrit:811406|Revert "Hide the lede section on mobile when DiscussionTools is enabled" (T312177)]] (duration: 03m 37s)
[07:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:20:44] <stashbot>	 T312177: mediawiki.org Main Page issue on mobile - https://phabricator.wikimedia.org/T312177
[07:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:59] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Disabling P_S increased number of concurrent connections I could make (700*) but still started to throw the same errors...
[07:21:58] <godog>	 mmhh there might be some P A G E alerts coming in, in the alert text only, not actual oncall pages tho
[07:22:08] <Kemayo>	 Amir1: Thanks for the help!
[07:22:35] <Amir1>	 Kemayo: thank you for building this awesome tool. I just hit shiny buttons and copy pasted stuff
[07:23:06] <Kemayo>	 😂
[07:23:31] <RhinosF1>	 godog: https://phabricator.wikimedia.org/T312194 got created too
[07:23:57] <godog>	 RhinosF1: ah thank you, yeah that makes sense!
[07:24:26] <RhinosF1>	 godog: np
[07:24:48] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1005 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:25:04] <RhinosF1>	 Alert went off with the text in -serviceops for gitlab
[07:27:21] <godog>	 ah that explains why we didn't see the P A G E here
[07:28:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: Remove node for reimage
[07:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2024.codfw.wmnet with reason: Remove node for reimage
[07:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:16] <RhinosF1>	 godog: there was only 1. Not the list on the task.
[07:29:45] <RhinosF1>	 I think you're in #wikimedia-serviceops though so you can see
[07:29:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Nice!" [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1135, if anything breaks, it's marostegui's fault (T311106)', diff saved to https://phabricator.wikimedia.org/P30880 and previous config saved to /var/cache/conftool/dbconfig/20220706-073052-ladsgroup.json
[07:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:57] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[07:30:58] <marostegui>	 XD
[07:31:34] <godog>	 RhinosF1: yeah, it was one notification though notice the (8) in the text, meaning the actual alerts firing are 8
[07:31:53] <RhinosF1>	 I see!
[07:31:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[07:32:02] <wikibugs>	 (03PS3) 10Filippo Giunchedi: netops: add DNS probes alerts [alerts] - 10https://gerrit.wikimedia.org/r/811207 (https://phabricator.wikimedia.org/T169860)
[07:32:26] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:32:39] <marostegui>	 haproxy alerts are expected
[07:33:16] <vgutierrez>	 uh? ack!
[07:35:27] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: install php 7.4 on all maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/811585 (https://phabricator.wikimedia.org/T311386)
[07:36:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php 7.4 on all maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/811585 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto)
[07:40:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30881 and previous config saved to /var/cache/conftool/dbconfig/20220706-074028-ladsgroup.json
[07:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:32] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[07:40:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30882 and previous config saved to /var/cache/conftool/dbconfig/20220706-074051-ladsgroup.json
[07:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2024.codfw.wmnet with OS bullseye
[07:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:09] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bullseye
[07:42:32] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (31) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002
[07:42:32] <icinga-wm>	 -fe2003, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[07:43:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admin: add gitlab-roots group to gitlab_runner role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn)
[07:45:54] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) >>! In T311106#8054391, @Ladsgroup wrote: > Disabling P_S increased number of concurrent connections I could make (700*...
[07:47:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P30883 and previous config saved to /var/cache/conftool/dbconfig/20220706-074721-root.json
[07:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:32] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) ` ===== NODE GROUP ===== (4) db[1111,1127,1132,1143].eqiad.wmnet ----- OUTPUT of 'sudo mysql -e "s...rformance_schema'...
[07:52:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30884 and previous config saved to /var/cache/conftool/dbconfig/20220706-075206-root.json
[07:52:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30885 and previous config saved to /var/cache/conftool/dbconfig/20220706-075211-root.json
[07:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30886 and previous config saved to /var/cache/conftool/dbconfig/20220706-075224-root.json
[07:52:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30887 and previous config saved to /var/cache/conftool/dbconfig/20220706-075532-ladsgroup.json
[07:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:36] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[07:55:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30888 and previous config saved to /var/cache/conftool/dbconfig/20220706-075555-ladsgroup.json
[07:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:14] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:56:18] <icinga-wm>	 RECOVERY - Check systemd state on mw2378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:56:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386)
[07:58:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage
[07:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 jnuche and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T0800).
[08:01:27] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1028.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1028.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2024.codfw.wmnet with reason: host reimage
[08:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:35] <wikibugs>	 (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072)
[08:03:37] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche)
[08:04:20] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811589 (https://phabricator.wikimedia.org/T308072) (owner: 10Jaime Nuche)
[08:07:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30889 and previous config saved to /var/cache/conftool/dbconfig/20220706-080710-root.json
[08:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30890 and previous config saved to /var/cache/conftool/dbconfig/20220706-080715-root.json
[08:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30891 and previous config saved to /var/cache/conftool/dbconfig/20220706-080728-root.json
[08:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1029.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:32] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.19  refs T308072
[08:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:35] <stashbot>	 T308072: 1.39.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T308072
[08:09:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1029.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:09:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:19] <icinga-wm>	 RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:10:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30892 and previous config saved to /var/cache/conftool/dbconfig/20220706-081036-ladsgroup.json
[08:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:41] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[08:10:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30893 and previous config saved to /var/cache/conftool/dbconfig/20220706-081059-ladsgroup.json
[08:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:05] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:12:12] <logmsgbot>	 !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.19  refs T308072 (duration: 03m 39s)
[08:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:12:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:13:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1030.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1030.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:59] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:18:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2024.codfw.wmnet with OS bullseye
[08:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:50] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS bullseye completed: - ganeti2024 (**PASS**)   - Downtimed on...
[08:20:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1031.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:45] <wikibugs>	 (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro)
[08:20:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) a:03Papaul @papaul, nice! We should keep all the same switch's uplinks on the same breakout cable: So instead of doing: 0/0 - asw2-c-eqiad:xe-2/0/[44...
[08:21:29] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1031.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:21:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30894 and previous config saved to /var/cache/conftool/dbconfig/20220706-082214-root.json
[08:22:15] <icinga-wm>	 RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30895 and previous config saved to /var/cache/conftool/dbconfig/20220706-082219-root.json
[08:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30896 and previous config saved to /var/cache/conftool/dbconfig/20220706-082232-root.json
[08:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1032.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1032.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:25:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1033.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30897 and previous config saved to /var/cache/conftool/dbconfig/20220706-082540-ladsgroup.json
[08:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:51] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[08:26:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Test done (T311106)', diff saved to https://phabricator.wikimedia.org/P30898 and previous config saved to /var/cache/conftool/dbconfig/20220706-082603-ladsgroup.json
[08:26:05] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:55] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1033.eqiad.wmnet: Renew puppet certificate - elukey@cumin1001
[08:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:47] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:28:42] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting: PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10Volans) I see that now the crontab entries are: ` */5 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wiki...
[08:30:32] <icinga-wm>	 RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:31:51] <wikibugs>	 (03PS1) 10Urbanecm: GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661
[08:31:53] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662
[08:33:25] <phuedx>	 jnuche: Is the train deployment complete? Any objections if I merge a Beta Cluster only config patch?
[08:34:08] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:55] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10jcrespo) > Disable P_S on all the sX hosts that run 10.6  Note disabling P_S on production hosts will break the query killer.
[08:35:33] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386)
[08:36:33] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1029.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30899 and previous config saved to /var/cache/conftool/dbconfig/20220706-083718-root.json
[08:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30900 and previous config saved to /var/cache/conftool/dbconfig/20220706-083723-root.json
[08:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30901 and previous config saved to /var/cache/conftool/dbconfig/20220706-083736-root.json
[08:37:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1029.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:54] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1030.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: install php7.4 on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/808912 (https://phabricator.wikimedia.org/T311386) (owner: 10Giuseppe Lavagetto)
[08:39:10] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1030.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:39:12] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1031.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:09] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm)
[08:40:17] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm)
[08:40:29] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1031.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:40:30] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1032.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:42] <wikibugs>	 (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro)
[08:41:46] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1032.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:48] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1033.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:56] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) >>! In T311106#8054563, @jcrespo wrote: >> Disable P_S on all the sX hosts that run 10.6 >  > Note disabling P_S on pro...
[08:43:10] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1033.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:40] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:46] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:44:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811270 (owner: 10David Caro)
[08:44:22] <wikibugs>	 (03PS1) 10Urbanecm: [beta] GrowthExperiments: Remove variables that are primarily set on-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811663
[08:44:24] <wikibugs>	 (03PS1) 10Urbanecm: GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664
[08:44:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "profile::mariadb::packages_wmf: Remove support for stretch" [puppet] - 10https://gerrit.wikimedia.org/r/811270 (owner: 10David Caro)
[08:45:15] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui)
[08:46:13] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2028.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:46:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] toolsdb: enable pt-heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/763584 (owner: 10Majavah)
[08:47:32] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2028.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2029.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:54] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2029.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2030.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:38] <icinga-wm>	 RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:50:18] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2030.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:20] <jnuche>	 phuedx: hi, yeah, the train deployment is finished, please go ahead
[08:50:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2031.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:33] <phuedx>	 jnuche: Thanks!
[08:51:39] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2031.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:41] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2032.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30902 and previous config saved to /var/cache/conftool/dbconfig/20220706-085221-root.json
[08:52:24] <icinga-wm>	 RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:27] <wikibugs>	 (03PS1) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667
[08:52:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30903 and previous config saved to /var/cache/conftool/dbconfig/20220706-085227-root.json
[08:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:38] <icinga-wm>	 RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:52:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30904 and previous config saved to /var/cache/conftool/dbconfig/20220706-085240-root.json
[08:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:57] <phuedx>	 Since it's BC-only. I'll merge it, let it roll out to the Beta Cluster, and then pull it onto the deployment host
[08:53:02] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2032.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:53:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2033.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede)
[08:54:23] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2033.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:54:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2034.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:14] <icinga-wm>	 RECOVERY - puppet last run on ms-be2033 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:55:45] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2034.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:29] <wikibugs>	 (03CR) 10Phuedx: [C: 03+2] beta: Add mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx)
[08:57:23] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Add mediawiki.web_ui.interactions event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811322 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx)
[08:57:58] <wikibugs>	 (03PS2) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667
[08:58:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede)
[08:58:48] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2036.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[08:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2036.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:00:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2037.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:38] <wikibugs>	 (03PS3) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667
[09:01:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667 (owner: 10Slyngshede)
[09:01:30] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2037.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:32] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2038.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:01:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:54] <phuedx>	 I've updated the deployment host prior to the next backport window. I'm testing on the Beta Cluster now
[09:02:00] <icinga-wm>	 RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:02:06] <wikibugs>	 (03PS4) 10Slyngshede: Add Nagios script for monitoring the Dell PERC RAID controller. [puppet] - 10https://gerrit.wikimedia.org/r/811667
[09:02:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Adding Moritz and Slyngshede to review the repos side, the kubeadm looks good :+1:" [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[09:02:53] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2038.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:02:55] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be2039.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:16] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be2039.codfw.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:04:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre: add php busy workers alerts for parsoid, jobrunners (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/797313 (owner: 10Giuseppe Lavagetto)
[09:06:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[09:06:38] <urbanecm>	 phuedx: i assume you're done with your beta cluster deployment?
[09:06:55] <urbanecm>	 (i'd like to do my own)
[09:07:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30905 and previous config saved to /var/cache/conftool/dbconfig/20220706-090725-root.json
[09:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:30] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[09:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30906 and previous config saved to /var/cache/conftool/dbconfig/20220706-090731-root.json
[09:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[09:07:41] <phuedx>	 urbanecm: Yes :)
[09:07:43] <phuedx>	 All yours
[09:07:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30907 and previous config saved to /var/cache/conftool/dbconfig/20220706-090744-root.json
[09:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:49] <urbanecm>	 thanks!
[09:08:26] <wikibugs>	 (03PS2) 10Urbanecm: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661
[09:08:30] <wikibugs>	 (03PS3) 10Urbanecm: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661
[09:08:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm)
[09:08:48] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662
[09:08:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm)
[09:09:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1035.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:45] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Remove redundant IS-labs definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811661 (owner: 10Urbanecm)
[09:09:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[09:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:25] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1035.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:10:27] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1036.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:31] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Remove overrides for GrowthExperiments enable percentage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811662 (owner: 10Urbanecm)
[09:10:58] * urbanecm done
[09:11:43] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1036.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1037.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:13] <wikibugs>	 (03PS3) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578)
[09:13:04] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1037.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:13:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1038.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[09:14:24] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1038.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:14:26] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.puppet.renew-cert for ms-be1039.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:50] <wikibugs>	 (03CR) 10JMeybohm: k8s: Add KubernetesNode.taints propertry (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:15:36] <wikibugs>	 (03PS3) 10JMeybohm: Alert on helm releases in bad state [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714)
[09:15:44] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-be1039.eqiad.wmnet: Renew puppet certificate - mvernon@cumin1001
[09:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:10] <icinga-wm>	 RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:17:14] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] site/hiera: remove gitlab2001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[09:17:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30908 and previous config saved to /var/cache/conftool/dbconfig/20220706-091717-root.json
[09:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:49] <wikibugs>	 (03PS2) 10Jelto: site/hiera: remove gitlab2001 after decom [puppet] - 10https://gerrit.wikimedia.org/r/811362 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[09:18:02] <wikibugs>	 (03PS1) 10David Caro: distributions-wikimedia: add note to the docs [puppet] - 10https://gerrit.wikimedia.org/r/811671
[09:19:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see a couple of minor comments." [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[09:20:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/811671 (owner: 10David Caro)
[09:21:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P30911 and previous config saved to /var/cache/conftool/dbconfig/20220706-092130-root.json
[09:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:04] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) After having a chat with Jaime: - db1132 got P_S enabled but with `performance-schema-instrument='memory/%=OFF'`
[09:22:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30912 and previous config saved to /var/cache/conftool/dbconfig/20220706-092229-root.json
[09:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30913 and previous config saved to /var/cache/conftool/dbconfig/20220706-092237-root.json
[09:22:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30914 and previous config saved to /var/cache/conftool/dbconfig/20220706-092248-root.json
[09:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:22:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:wmcs::metricsinfra::prometheus: enable thanos sidecar [puppet] - 10https://gerrit.wikimedia.org/r/806551 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[09:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:01] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:metricsinfra: add thanos query [puppet] - 10https://gerrit.wikimedia.org/r/806552 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[09:23:07] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:metricsinfra::haproxy: add thanos routing [puppet] - 10https://gerrit.wikimedia.org/r/806553 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[09:23:19] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2024.codfw.wmnet
[09:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[09:23:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:12] <wikibugs>	 (03PS4) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:24:30] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:04] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] distributions-wikimedia: add note to the docs [puppet] - 10https://gerrit.wikimedia.org/r/811671 (owner: 10David Caro)
[09:25:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:25:16] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[09:26:03] <wikibugs>	 (03CR) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:26:40] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:27:23] <wikibugs>	 (03PS5) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:27:25] <wikibugs>	 (03CR) 10David Caro: Remove hiera files for nonexistent Cloud VPS instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:28:46] <wikibugs>	 (03PS4) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956)
[09:28:57] <wikibugs>	 (03PS4) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578)
[09:32:44] <phuedx>	 Anyone familiar with the Beta Cluster config? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/811322 doesn't seem to be applying and I can't figure out why
[09:33:08] <phuedx>	 urbanecm maybe? ^
[09:33:27] <urbanecm>	 phuedx: what does "doesn't seem to be applying" mean?
[09:33:33] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:32] <phuedx>	 urbanecm: The stream that I added isn't visible in the output of https://en.wikipedia.beta.wmflabs.org/w/api.php?action=streamconfigs&format=json&all_settings=1, say
[09:34:36] <wikibugs>	 (03PS1) 10Jelto: wikimedia.org: remove gitlab-replica-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142)
[09:35:06] <phuedx>	 Hrrm
[09:36:13] <urbanecm>	 i see
[09:36:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P30915 and previous config saved to /var/cache/conftool/dbconfig/20220706-093634-root.json
[09:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:07] <urbanecm>	 at the very least, i can confirm the config is on the servers themselves
[09:37:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30916 and previous config saved to /var/cache/conftool/dbconfig/20220706-093733-root.json
[09:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30917 and previous config saved to /var/cache/conftool/dbconfig/20220706-093741-root.json
[09:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P30918 and previous config saved to /var/cache/conftool/dbconfig/20220706-093752-root.json
[09:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:35] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10LSobanski)
[09:41:42] <wikibugs>	 (03CR) 10JMeybohm: Alert on helm releases in bad state (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/808968 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[09:41:42] <phuedx>	 It looks like the +group2 bit isn't being merged in whereas the +enwiki bit is
[09:42:47] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:43:31] <wikibugs>	 (03CR) 10Gergő Tisza: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[09:43:38] <wikibugs>	 (03PS5) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578)
[09:45:12] <wikibugs>	 (03CR) 10David Caro: novafullstack: Refactor and minor fix (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro)
[09:46:02] <urbanecm>	 phuedx: yeah. not sure why though. i recommend trying to debug it locally (you can run `composer buildConfigCache` in your config repo, and files like `wmf-config/config-cache/conf-labs-enwiki.json` will then have the configuration as it will be seen by MediaWiki)
[09:46:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Remove hiera files for nonexistent Cloud VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/767925 (owner: 10Majavah)
[09:48:17] <wikibugs>	 (03CR) 10JMeybohm: k8s: Retry checks for expected pods on drain (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:50:18] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[09:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:23] <urbanecm>	 my testing shows that the group2 bit is just ignored (might be because group2 doesn't really make any sense when talking about beta, but that's just a guess)
[09:51:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P30919 and previous config saved to /var/cache/conftool/dbconfig/20220706-095138-root.json
[09:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:14] <wikibugs>	 (03PS4) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543
[09:52:18] <wikibugs>	 (03PS3) 10David Caro: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367
[09:52:22] <wikibugs>	 (03PS3) 10David Caro: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451
[09:52:26] <wikibugs>	 (03PS9) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368
[09:52:44] <volans>	 we just lost wikibugs
[09:53:17] <urbanecm>	 poor bot :(
[09:55:51] <phuedx>	 urbanecm: Agreed. Changing +group2 to +wikipedia, for example, fixes the problem
[09:56:03] <phuedx>	 Also, poor wikibugs
[09:58:16] <urbanecm>	 phuedx: in that case, an easy workaround would be group2 => wikipedia (has roughly the same meaning anyway). an alternative solution would be to introduce `wmgExtraEventStreams` and call `$wgEventStreams = array_merge( $wgEventStreams, $wmgExtraEventStreams )` in CommonSettings-labs.php (which'd be an equivalent of `+default`, if it was supported).
[09:59:27] <wikibugs>	 (03PS1) 10Ayounsi: Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676
[09:59:34] <volans>	 !log restarted wikibugs
[09:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:40] <urbanecm>	 welcome back, wikibugs 
[09:59:57] <urbanecm>	 and thanks volans for resuscitating them
[10:00:33] <volans>	 np :)
[10:00:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.openstack.cloudgw: add reboot_node and roll_reboot_cloudgws [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810914 (owner: 10David Caro)
[10:02:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade.
[10:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cloudvirt.safe_reboot: remove non-used openstack_api property [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/811675 (owner: 10David Caro)
[10:04:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro)
[10:04:49] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro)
[10:04:58] <wikibugs>	 (03PS1) 10Phuedx: beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678
[10:05:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 (owner: 10David Caro)
[10:05:26] <phuedx>	 urbanecm: ^^ I've also tried to explain why I took the short route :)
[10:05:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx)
[10:06:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi)
[10:06:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After restart', diff saved to https://phabricator.wikimedia.org/P30920 and previous config saved to /var/cache/conftool/dbconfig/20220706-100642-root.json
[10:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:08] <wikibugs>	 (03CR) 10Phuedx: "I've confirmed that the mediawiki.web_ui.interactions stream shows up in wgEventStreams and wgEventLoggingStreamNames in wmf-config/config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx)
[10:07:35] <phuedx>	 Right. I'll merge that and get it on the deployment host
[10:08:24] <wikibugs>	 (03CR) 10Phuedx: [C: 03+2] beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx)
[10:09:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[10:09:11] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Correctly add mediawiki.web_ui.interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811678 (owner: 10Phuedx)
[10:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:07] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro)
[10:11:12] <wikibugs>	 (03Merged) 10jenkins-bot: cloudvirt.safe_reboot: remove non-used openstack_api property [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/811675 (owner: 10David Caro)
[10:11:14] <wikibugs>	 (03Merged) 10jenkins-bot: alerts: add a default duration of 1h [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810367 (owner: 10David Caro)
[10:11:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811336 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:11:47] <wikibugs>	 (03CR) 10Jelto: "Added some comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[10:11:56] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.lib.openstack: move to a directory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810451 (owner: 10David Caro)
[10:12:48] <wikibugs>	 (03PS1) 10Muehlenhoff: bigtop::hadoop: All hosts use the new GID/UID scheme by now [puppet] - 10https://gerrit.wikimedia.org/r/811680
[10:12:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I'm not familiar with the underlying issue, but python wise LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/811331 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[10:13:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi)
[10:13:32] <wikibugs>	 (03CR) 10Slavina Stefanova: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro)
[10:13:54] <wikibugs>	 (03Merged) 10jenkins-bot: Remove cloudstore hosts from cloud ACL [homer/public] - 10https://gerrit.wikimedia.org/r/811676 (owner: 10Ayounsi)
[10:14:53] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[10:15:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:16:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:21] <wikibugs>	 (03PS2) 10Majavah: openstack: horizon: remove enc url from hiera [puppet] - 10https://gerrit.wikimedia.org/r/800232
[10:19:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2024.codfw.wmnet
[10:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30921 and previous config saved to /var/cache/conftool/dbconfig/20220706-102146-root.json
[10:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:11] <wikibugs>	 (03PS1) 10Muehlenhoff: installserver: Remove support for pre buster [puppet] - 10https://gerrit.wikimedia.org/r/811681
[10:22:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1001.eqiad.wmnet
[10:22:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[10:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:05] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:27:03] <wikibugs>	 (03PS1) 10Muehlenhoff: ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682
[10:27:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:27:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1001.eqiad.wmnet on all recursors
[10:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1001.eqiad.wmnet on all recursors
[10:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:09] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36197/console" [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff)
[10:30:18] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff)
[10:30:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1009.eqiad.wmnet
[10:30:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2009.codfw.wmnet
[10:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff)
[10:36:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30923 and previous config saved to /var/cache/conftool/dbconfig/20220706-103650-root.json
[10:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio)
[10:37:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1001.eqiad.wmnet
[10:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1002.eqiad.wmnet
[10:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:56] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[10:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1009.eqiad.wmnet
[10:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2009.codfw.wmnet
[10:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:11] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[10:40:14] <wikibugs>	 (03PS1) 10Volans: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684
[10:40:49] <wikibugs>	 (03PS2) 10Volans: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684
[10:42:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:42:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1002.eqiad.wmnet on all recursors
[10:42:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1002.eqiad.wmnet on all recursors
[10:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff)
[10:43:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch image reports over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811324 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff)
[10:43:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: MM3/Postorius: Inconsistent translations for "Log In" in Spanish - https://phabricator.wikimedia.org/T312204 (10MarcoAurelio)
[10:44:00] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[10:44:17] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet
[10:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1028.eqiad.wmnet
[10:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30925 and previous config saved to /var/cache/conftool/dbconfig/20220706-105154-root.json
[10:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1002.eqiad.wmnet
[10:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:09] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2028.codfw.wmnet
[10:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:35] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:54:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-etcd1003.eqiad.wmnet
[10:54:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[10:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1028.eqiad.wmnet
[10:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:37] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2029.codfw.wmnet
[10:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:42] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1029.eqiad.wmnet
[10:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:49] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:39] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (32) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, elastic2049, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002
[11:01:39] <icinga-wm>	 -fe2003, thumbor1002, thumbor1006, thumbor2003, thumbor2004, thumbor2005, thumbor2006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[11:03:21] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m2 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:03:23] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m3 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:03:27] <icinga-wm>	 PROBLEM - mysqld processes on db2078 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:03:27] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m5 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:03:51] <icinga-wm>	 PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@m3.service,wmf_auto_restart_prometheus-mysqld-exporter@m5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:11] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:13] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: m3 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:23] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:43] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:55] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: m5 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:04:57] <icinga-wm>	 PROBLEM - MariaDB read only m1 on db2078 is CRITICAL: Could not connect to localhost:3321 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:05:11] <icinga-wm>	 PROBLEM - MariaDB read only m2 on db2078 is CRITICAL: Could not connect to localhost:3322 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:05:23] <icinga-wm>	 PROBLEM - MariaDB read only m3 on db2078 is CRITICAL: Could not connect to localhost:3323 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:05:33] <icinga-wm>	 PROBLEM - MariaDB read only m5 on db2078 is CRITICAL: Could not connect to localhost:3325 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[11:05:39] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m5 on db2078 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:05:39] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: m2 on db2078 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:05:45] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m1 on db2078 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:06:12] <RhinosF1>	 marostegui: ^ is expected isn't it? You said you were shutting it down earlier .
[11:06:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30927 and previous config saved to /var/cache/conftool/dbconfig/20220706-110658-root.json
[11:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:09] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2029.codfw.wmnet
[11:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:19] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1029.eqiad.wmnet
[11:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:21] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:52] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2030.codfw.wmnet
[11:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:21] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1030.eqiad.wmnet
[11:09:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:15:48] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2030.codfw.wmnet
[11:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:27] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1030.eqiad.wmnet
[11:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:16] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] wikimedia.org: remove gitlab-replica-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/811674 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[11:26:57] <wikibugs>	 (03PS3) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[11:27:31] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:28:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2031.codfw.wmnet
[11:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:07] <wikibugs>	 (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[11:29:15] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1031.eqiad.wmnet
[11:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1031.eqiad.wmnet
[11:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:18] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2031.codfw.wmnet
[11:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:11] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2032.codfw.wmnet
[11:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:24] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1032.eqiad.wmnet
[11:42:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:22] <wikibugs>	 (03Abandoned) 10Jelto: site: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[11:47:38] <wikibugs>	 (03PS4) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[11:51:16] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1032.eqiad.wmnet
[11:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:25] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2032.codfw.wmnet
[11:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:59] <wikibugs>	 (03PS2) 10Jelto: gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[11:54:05] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[11:56:00] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "rebase + merge conflict in patch set 2. I removed gitlab2001, which is decommissioned now." [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[11:56:45] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:05] <wikibugs>	 (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[11:57:14] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[11:57:15] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2033.codfw.wmnet
[11:57:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:28] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1033.eqiad.wmnet
[11:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:35] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[11:59:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a helper function to query the disk type of a VM [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116)
[11:59:35] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff)
[11:59:39] <wikibugs>	 10SRE, 10Ganeti: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff)
[12:00:33] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:03:47] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:20] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2033.codfw.wmnet
[12:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:39] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1033.eqiad.wmnet
[12:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:03] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2034.codfw.wmnet
[12:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:50] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1035.eqiad.wmnet
[12:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:51] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1035.eqiad.wmnet
[12:15:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:07] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2034.codfw.wmnet
[12:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:41] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8807.service,thumbor@8811.service,thumbor@8814.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:31] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:41] <wikibugs>	 (03PS2) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[12:20:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Thanks, looks great!" [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans)
[12:21:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff)
[12:21:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ores: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/811682 (owner: 10Muehlenhoff)
[12:26:54] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::rsyslog::kubernetes: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/811699
[12:28:34] <wikibugs>	 (03PS3) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[12:28:40] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[12:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:35] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:39:55] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:12] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw
[12:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:53] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:40:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:41:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:41:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-etcd1003.eqiad.wmnet on all recursors
[12:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-etcd1003.eqiad.wmnet on all recursors
[12:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/800232 (owner: 10Majavah)
[12:43:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1008.wikimedia.org
[12:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:09] <wikibugs>	 (03PS4) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[12:45:17] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:46:46] <wikibugs>	 (03PS1) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032)
[12:47:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "sgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff)
[12:47:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[12:48:10] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:48:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:48:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:04] <wikibugs>	 (03PS10) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[12:49:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:14] <wikibugs>	 (03PS11) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[12:49:18] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "per https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports, the name should be image-suggestion-api and port 4009" [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:49:19] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2035.codfw.wmnet
[12:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:38] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1036.eqiad.wmnet
[12:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:59] <wikibugs>	 (03PS2) 10Vlad.shapik: WIP: Adjust the online tests to new changes in the thumbor functionality [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/811257
[12:50:14] <wikibugs>	 (03PS5) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[12:50:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:50:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:50:45] <wikibugs>	 (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:51:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:51:37] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1008.wikimedia.org
[12:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:45] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1008.wikimedia.org` - cloudst...
[12:53:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[12:53:56] <wikibugs>	 (03PS12) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[12:54:09] <wikibugs>	 (03CR) 10Jbond: "can we pause this until im back, i did some refactoring of the raid classes nd have a feeling i was thinking of moving away from using the" [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff)
[12:56:01] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1036.eqiad.wmnet
[12:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:09] <wikibugs>	 (03PS13) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[12:56:34] <wikibugs>	 (03Abandoned) 10Jbond: C:monitoring: Add define for creating http checks [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond)
[12:56:55] <wikibugs>	 (03PS2) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032)
[12:57:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:57:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:57:43] <wikibugs>	 (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:57:47] <wikibugs>	 (03PS6) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[12:57:59] <wikibugs>	 (03Abandoned) 10Jbond: O:nrpe: add check_http_wmf script [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond)
[12:58:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:58:19] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2035.codfw.wmnet
[12:58:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:49] <wikibugs>	 (03PS3) 10Kosta Harlan: Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032)
[12:58:56] <wikibugs>	 (03CR) 10Kosta Harlan: Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:59:22] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[12:59:24] <wikibugs>	 (03PS1) 10Ayounsi: cr: policy-options add missing return [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194)
[12:59:31] <wikibugs>	 (03CR) 10Muehlenhoff: Extend custom raid fact to support Perc 750 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff)
[12:59:33] <wikibugs>	 (03PS14) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[12:59:57] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1300).
[13:00:05] <jouncebot>	 kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <Lucas_WMDE>	 o/
[13:00:14] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2036.codfw.wmnet
[13:00:17] <urbanecm>	 o/
[13:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We need to set up a service proxy instance for the image-suggestion service first, then use that port here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[13:00:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1037.eqiad.wmnet
[13:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:00:33] <wikibugs>	 (03PS15) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032)
[13:01:15] <Lucas_WMDE>	 is the config change ready to deploy? it says it Depends-On a puppet change that’s still open
[13:01:19] <kostajh>	 i'm here, but still sorting out some issues with my patch
[13:01:23] <Lucas_WMDE>	 ok
[13:01:27] <kostajh>	 which I'm doubtful about getting done now, but let's see
[13:01:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844)
[13:02:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "noop on the devices." [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194) (owner: 10Ayounsi)
[13:03:15] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[13:03:27] <wikibugs>	 (03PS7) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: cr: policy-options add missing return [homer/public] - 10https://gerrit.wikimedia.org/r/811706 (https://phabricator.wikimedia.org/T253194) (owner: 10Ayounsi)
[13:03:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudstore1008.wikimedia.org
[13:03:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudstore1008.wikimedia.org
[13:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudstore1009.wikimedia.org
[13:04:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudstore1009.wikimedia.org
[13:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:06] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cloudstore1008.wikimedia.org
[13:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:10] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Set GEImageRecommendationApiHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808208 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[13:04:13] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cloudstore1009.wikimedia.org
[13:04:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Aline_Bruenger_WMDE)
[13:05:47] <kostajh>	 Lucas_WMDE urbanecm: can we give it another 10 minutes or so, as I'm getting some CR comments.
[13:06:07] <urbanecm>	 kostajh: no issues at all
[13:06:12] <Lucas_WMDE>	 I can’t deploy Puppet changes anyways, not sure if urbanecm can (and would be willing to)
[13:06:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:06:39] <urbanecm>	 i can't do puppet changes (but happy to do the MW counterpart once puppet is resolved)
[13:06:55] <Lucas_WMDE>	 ok
[13:07:09] <kostajh>	 I'm talking with _j.oe_ about the puppet change in #wikimedia-sre
[13:07:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844)
[13:07:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add image-suggestion listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[13:08:58] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2036.codfw.wmnet
[13:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/810867 (https://phabricator.wikimedia.org/T311999) (owner: 10Muehlenhoff)
[13:09:05] <wikibugs>	 (03PS8) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:09:17] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1037.eqiad.wmnet
[13:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2037.codfw.wmnet
[13:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:12] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1038.eqiad.wmnet
[13:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add image-suggestion listener to service-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811701 (https://phabricator.wikimedia.org/T306032) (owner: 10Kosta Harlan)
[13:10:25] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] P:mediawiki::scap_client: add parameter to indicate scap master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807510 (https://phabricator.wikimedia.org/T310740) (owner: 10Jbond)
[13:11:04] <wikibugs>	 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) I did a brief analysis on space vs retention vs resolution:  | resolution | #samples | #series | bytes | -- | -- | -- | -- | | 0s | 29.1B | 4B | 40TB | 5m | 5.8B |  2.6B | 30TB | 1h  | 474.5M | 2.4B | 3.7TB...
[13:11:16] <kostajh>	 the puppet patch needs ~30 minutes to propagate. So, that is still within this window, but not sure about whether to go forward with this.
[13:11:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[13:17:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1132 (T311106)', diff saved to https://phabricator.wikimedia.org/P30930 and previous config saved to /var/cache/conftool/dbconfig/20220706-131715-ladsgroup.json
[13:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:19] <wikibugs>	 (03CR) 10Ayounsi: "A couple comments then we're good! I had a look at what's running on netbox-next as well." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[13:17:20] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[13:18:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm)
[13:18:48] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Remaining nodes done by hand during reboots for T310483: ` mvernon@cumin1001:~$ sudo cumin...
[13:19:01] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1038.eqiad.wmnet
[13:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:11] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1039.eqiad.wmnet
[13:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:39] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2037.codfw.wmnet
[13:19:46] <kostajh>	 Lucas_WMDE / urbanecm: there's issues with the proxy, so let's leave this patch out for now and I'll look for another window to deploy it.
[13:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans)
[13:19:59] <urbanecm>	 ack
[13:20:00] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2038.codfw.wmnet
[13:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:23] <Lucas_WMDE>	 ok
[13:20:29] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:21:01] <icinga-wm>	 RECOVERY - MariaDB read only m1 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 61s, read_only: True, event_scheduler: True, 20.73 QPS, connection latency: 0.003852s, query latency: 0.000360s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:21:11] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:21:17] <icinga-wm>	 RECOVERY - MariaDB read only m2 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 67s, read_only: True, event_scheduler: True, 11.84 QPS, connection latency: 0.003672s, query latency: 0.000367s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:21:27] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:21:31] <icinga-wm>	 RECOVERY - MariaDB read only m3 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 76s, read_only: True, event_scheduler: True, 12.84 QPS, connection latency: 0.003884s, query latency: 0.000317s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:21:37] <icinga-wm>	 RECOVERY - MariaDB read only m5 on db2078 is OK: Version 10.4.25-MariaDB-log, Uptime 75s, read_only: True, event_scheduler: True, 14.73 QPS, connection latency: 0.003573s, query latency: 0.000348s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[13:21:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:21:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[13:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-misc in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:22:01] <icinga-wm>	 RECOVERY - mysqld processes on db2078 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[13:22:47] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: m3 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:23:13] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: m1 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:23:23] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: m5 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti.*: automatically get default group [cookbooks] - 10https://gerrit.wikimedia.org/r/811684 (owner: 10Volans)
[13:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[13:24:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m5 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:24:07] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: m2 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:24:17] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m1 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:24:23] <wikibugs>	 (03PS9) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:24:25] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m2 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:24:27] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m3 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:24:33] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m5 on db2078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:25:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:27:29] <wikibugs>	 (03PS10) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:28:11] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:23] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1039.eqiad.wmnet
[13:28:23] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2038.codfw.wmnet
[13:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:35] <wikibugs>	 (03PS11) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:28:44] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2039.codfw.wmnet
[13:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:25] <wikibugs>	 (03PS12) 10Klausman: ml-services: add some more revscoring services to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195)
[13:30:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-etcd1003.eqiad.wmnet
[13:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:30:53] <wikibugs>	 (03PS1) 10Ayounsi: Add sre.network.configure-switch-interfaces to dcops sudo [puppet] - 10https://gerrit.wikimedia.org/r/811714
[13:31:53] <wikibugs>	 (03PS1) 10Majavah: prometheus: blackbox: don't deploy tls alerts when tls is disabled [puppet] - 10https://gerrit.wikimedia.org/r/811715
[13:31:56] <wikibugs>	 (03PS1) 10Majavah: prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716
[13:31:58] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717
[13:32:00] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718
[13:32:02] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719
[13:32:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716 (owner: 10Majavah)
[13:34:02] <wikibugs>	 (03PS2) 10Majavah: prometheus: blackbox: support exporting modules for other instances [puppet] - 10https://gerrit.wikimedia.org/r/811716
[13:34:04] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717
[13:34:06] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718
[13:34:08] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719
[13:34:23] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36198/console" [puppet] - 10https://gerrit.wikimedia.org/r/811680 (owner: 10Muehlenhoff)
[13:35:27] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2039.codfw.wmnet
[13:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[13:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 (owner: 10Majavah)
[13:38:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718 (owner: 10Majavah)
[13:39:13] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+1] bigtop::hadoop: All hosts use the new GID/UID scheme by now [puppet] - 10https://gerrit.wikimedia.org/r/811680 (owner: 10Muehlenhoff)
[13:40:06] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717
[13:40:08] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718
[13:40:10] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719
[13:41:25] <wikibugs>	 (03PS6) 10Aqu: [WIP] Build spark assembly for Spark3 [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578)
[13:41:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717 (owner: 10Majavah)
[13:42:25] <wikibugs>	 (03PS4) 10Majavah: P:toolforge::prometheus: import blackbox checks from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/811717
[13:42:27] <wikibugs>	 (03PS4) 10Majavah: P:toolforge::static: remove HTTPS enforcement [puppet] - 10https://gerrit.wikimedia.org/r/811718
[13:42:29] <wikibugs>	 (03PS4) 10Majavah: P:toolforge::static: add blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/811719
[13:42:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul)
[13:44:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[13:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:22] <elukey>	 urbanecm: o/
[13:44:29] <urbanecm>	 hi elukey!
[13:44:50] <elukey>	 sorry to bother, would you be available to help me to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810007 if I sneak it in the deployment window? :)
[13:44:56] <elukey>	 (I haven't done it in a while)
[13:45:01] <wikibugs>	 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10kostajh)
[13:45:16] <wikibugs>	 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10kostajh)
[13:45:41] <urbanecm>	 elukey: sure thing. do you want to try the deployment yourself? https://deploy-commands.toolforge.org/bacc/810007 should be helpful :)
[13:46:21] <urbanecm>	 (if not, i can also deploy it for you)
[13:46:37] <elukey>	 ah wow
[13:46:58] <elukey>	 if you have time please go ahead, I'll study the link and try the next time :)
[13:47:02] <urbanecm>	 okay
[13:47:11] <elukey>	 <3 thanks
[13:47:12] <wikibugs>	 (03PS4) 10Urbanecm: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[13:47:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[13:48:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey)
[13:48:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:49:01] <urbanecm>	 elukey: pulled to mwdebug1001 (not sure if it's testable there)
[13:49:22] <elukey>	 urbanecm: yeah I think you can go ahead, it is a event-gate specific thing I am afraid
[13:49:28] <urbanecm>	 okay, syncing
[13:50:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2024.codfw.wmnet to cluster codfw and group A
[13:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:28] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you welcome
[13:51:52] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721
[13:52:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) @wiki_willy you welcome
[13:53:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans)
[13:53:21] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810007|Add a new Eventgate stream for revision-score events (T301878)]] (duration: 03m 46s)
[13:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:25] <stashbot>	 T301878: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878
[13:53:32] <urbanecm>	 elukey: it should be live now. anything else i can help with?
[13:54:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:41] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.addnode (exit_code=97) for new host ganeti2024.codfw.wmnet to cluster codfw and group A
[13:54:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) (owner: 10Muehlenhoff)
[13:55:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:55:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:55] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans)
[13:56:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:11] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:57:05] <elukey>	 urbanecm: nope thanks a lot!!
[13:57:10] <urbanecm>	 any time
[13:59:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: fix switch matching [cookbooks] - 10https://gerrit.wikimedia.org/r/811721 (owner: 10Volans)
[14:05:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1008.wikimedia.org
[14:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:29] <wikibugs>	 (03CR) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[14:07:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:07:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:03] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:09:41] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:10:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:33] <wikibugs>	 (03CR) 10Muehlenhoff: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[14:11:57] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "One possible bug inline. Also missing tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[14:11:59] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:12:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48391 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add conf100[789] in DNS SRV records [dns] - 10https://gerrit.wikimedia.org/r/811728 (https://phabricator.wikimedia.org/T311407)
[14:13:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:55] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:15:07] <wikibugs>	 (03CR) 10Muehlenhoff: Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[14:15:20] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1008.wikimedia.org
[14:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:29] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1008.wikimedia.org` - cloudst...
[14:16:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudstore1009.wikimedia.org
[14:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Assign conf100[789] roles and add them to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/811729 (https://phabricator.wikimedia.org/T311407)
[14:16:46] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
[14:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:14] <akosiaris>	 !log pool codfw for kartotherian T305845
[14:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:18] <stashbot>	 T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845
[14:18:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/811714 (owner: 10Ayounsi)
[14:19:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs for cloudstore1008/cloudstore1009 [puppet] - 10https://gerrit.wikimedia.org/r/811711 (https://phabricator.wikimedia.org/T311844) (owner: 10Muehlenhoff)
[14:19:53] <wikibugs>	 (03CR) 10Volans: [C: 04-1] Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[14:20:42] <wikibugs>	 (03PS23) 10Ayounsi: Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262
[14:20:43] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:44] <wikibugs>	 (03PS1) 10Ayounsi: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730
[14:20:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809194 (owner: 10PipelineBot)
[14:21:10] <wikibugs>	 (03CR) 10Muehlenhoff: Add a helper function to query the disk type of a VM (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/811693 (https://phabricator.wikimedia.org/T312116) (owner: 10Muehlenhoff)
[14:21:36] <wikibugs>	 (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[14:21:43] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad
[14:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:43] <akosiaris>	 !log depool eqiad kartotherian T305845
[14:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:46] <stashbot>	 T305845: Re-import full planet data into codfw - https://phabricator.wikimedia.org/T305845
[14:24:58] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809194 (owner: 10PipelineBot)
[14:26:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "-1 for now as not sure if it's a good idea." [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi)
[14:26:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply
[14:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:15] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[14:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:19] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2006 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8805.service,thumbor@8809.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch image reports over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811324 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff)
[14:30:06] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[14:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:52] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[14:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:05] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[14:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[14:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:01] <icinga-wm>	 PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:42] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson replaced the mgmt cable this should take care of the flapping. If the problem persists please re-open and ping me.
[14:37:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:00] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10Cmjohnson) 05Open→03Resolved replaced the mgmt cable this should take care of the flapping. If the problem persists please re-open and ping me.
[14:38:10] <wikibugs>	 10SRE, 10ops-eqiad: SSH on wtp1040.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T312185 (10ssingh) Thanks for the help @Cmjohnson!
[14:38:24] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463)
[14:38:31] <icinga-wm>	 PROBLEM - Check no envoy runtime configuration is left persistent on mw1414 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[14:39:09] <icinga-wm>	 PROBLEM - Host ores1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:39:27] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudstore1009.wikimedia.org
[14:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:37] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cloudstore1009.wikimedia.org` - cloudst...
[14:40:34] <wikibugs>	 (03CR) 10Klausman: ml-services: add some more revscoring services to staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/811313 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[14:41:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy)
[14:41:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] scap: make scap::target require the scap class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809270 (https://phabricator.wikimedia.org/T310740) (owner: 10Ahmon Dancy)
[14:42:17] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:18] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add sre.network.configure-switch-interfaces to dcops sudo [puppet] - 10https://gerrit.wikimedia.org/r/811714 (owner: 10Ayounsi)
[14:44:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809990 (owner: 10Muehlenhoff)
[14:45:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10Cmjohnson)
[14:47:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) 05Open→03Resolved @cmooney the 2nd interface requires manual input, I mistakenly connected it to the mgmt port....
[14:47:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch image builds over to build2001 [puppet] - 10https://gerrit.wikimedia.org/r/811344 (https://phabricator.wikimedia.org/T298463) (owner: 10Muehlenhoff)
[14:49:40] <logmsgbot>	 !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 33s)
[14:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:16] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Entirely disabling performance_schema on 10.6 got 10.6 and 10.4 (with P_S ON) to die at the same time (more or less) ar...
[14:50:37] <wikibugs>	 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10JMeybohm) Ingress needs SNI and Host header to be set properly in order to be able to serve the correct certificate and route the request accordingly.
[14:51:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:52:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:29] <akosiaris>	 !log reboot poolcounter1005 for kernel upgrades
[14:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:52] <wikibugs>	 (03PS1) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225)
[14:53:59] <icinga-wm>	 PROBLEM - Host poolcounter1005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:25] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:59] <cmjohnson1>	 !log moving switch ports cloudcephosd1021 from cloudsw1-c to cloudsw2-c T310546
[14:56:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:03] <stashbot>	 T310546: Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546
[14:56:53] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1005 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811421
[14:56:59] <icinga-wm>	 RECOVERY - Host poolcounter1005 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[14:57:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Depool poolcounter1005 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811421 (owner: 10Alexandros Kosiaris)
[14:59:48] <wikibugs>	 (03PS1) 10Ottomata: Upstream release 0.273.3 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/811735 (https://phabricator.wikimedia.org/T311525)
[15:00:19] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Upstream release 0.273.3 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/811735 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[15:00:47] <logmsgbot>	 !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 28s)
[15:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:26] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:03:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Depool poolcounter1004 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811736
[15:03:16] <wikibugs>	 (03PS1) 10Jgiannelos: maps: Disable tilerator on codfw replicas [puppet] - 10https://gerrit.wikimedia.org/r/811737
[15:03:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:03:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add PHP 7.4 dependencies for LibreNMS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse)
[15:04:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:04:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:15] <moritzm>	 !log installing intel-microcode security updates
[15:05:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool poolcounter1004 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811736 (owner: 10Alexandros Kosiaris)
[15:08:13] <wikibugs>	 (03PS2) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225)
[15:08:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:52] <icinga-wm>	 RECOVERY - Check no envoy runtime configuration is left persistent on mw1414 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[15:09:44] <logmsgbot>	 !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 41s)
[15:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm)
[15:13:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:14:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[15:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:49] <wikibugs>	 (03PS1) 10Ottomata: analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525)
[15:17:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[15:17:50] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36200/console" [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[15:21:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[15:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:02] <wikibugs>	 (03PS3) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225)
[15:24:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36201/console" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm)
[15:24:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[15:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:17] <wikibugs>	 (03PS4) 10JMeybohm: service_proxy: Set SNI and Host header for ingress services [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225)
[15:28:18] <wikibugs>	 (03PS2) 10Ottomata: analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525)
[15:29:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] analytics_cluster presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[15:29:27] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36202/console" [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[15:30:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[15:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:22] <icinga-wm>	 PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl1001.eqiad.wmnet
[15:37:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[15:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:40:33] <wikibugs>	 (03PS5) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[15:41:12] <wikibugs>	 (03PS3) 10Ottomata: presto - reorg settings and unify configs for presto 0.273.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525)
[15:41:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:41:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:41:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1001.eqiad.wmnet on all recursors
[15:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1001.eqiad.wmnet on all recursors
[15:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:05] <wikibugs>	 (03PS1) 10JMeybohm: service-proxy: Set SNI and Host header for ingress services [deployment-charts] - 10https://gerrit.wikimedia.org/r/811744 (https://phabricator.wikimedia.org/T312225)
[15:45:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:42] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:51] <wikibugs>	 (03PS6) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[15:48:48] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[15:50:13] <wikibugs>	 (03PS7) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[15:51:19] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl1001.eqiad.wmnet
[15:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host dse-k8s-ctrl1002.eqiad.wmnet
[15:53:46] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[15:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:05] <wikibugs>	 (03CR) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:54:18] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:44] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:57:44] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.wipe-cache dse-k8s-ctrl1002.eqiad.wmnet on all recursors
[15:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dse-k8s-ctrl1002.eqiad.wmnet on all recursors
[15:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:25] <wikibugs>	 (03PS8) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[15:59:40] <wikibugs>	 (03PS9) 10Cathal Mooney: Add test in Netbox network report for port-block speeds on QFX5120 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/811314 (https://phabricator.wikimedia.org/T303529)
[16:00:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:07:27] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dse-k8s-ctrl1002.eqiad.wmnet
[16:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:51] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) 05Open→03Resolved a:03BTullis All 3 VMs created successfully.  I've also mo...
[16:15:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:17] <wikibugs>	 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10BTullis) 05Open→03Resolved a:03BTullis Both VMs successfully created. I'll resolve this ticket an...
[16:17:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36203/console" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm)
[16:17:10] <wikibugs>	 (03PS1) 10Btullis: Add DHCP boot entries for new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/811747 (https://phabricator.wikimedia.org/T310170)
[16:21:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/811733 (https://phabricator.wikimedia.org/T312225) (owner: 10JMeybohm)
[16:22:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:43] <wikibugs>	 (03PS1) 10Btullis: Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170)
[16:27:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add DHCP boot entries for new dse-k8s VMs [puppet] - 10https://gerrit.wikimedia.org/r/811747 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis)
[16:29:30] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:02] <wikibugs>	 (03PS2) 10Btullis: Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170)
[16:37:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the new dse-k8s servers with the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/811749 (https://phabricator.wikimedia.org/T310170) (owner: 10Btullis)
[16:41:41] <wikibugs>	 (03PS1) 10JMeybohm: Use the generic service_proxy definition for envoy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/811751
[16:41:43] <wikibugs>	 (03PS1) 10JMeybohm: Remove the need for charts to define services_procxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752
[16:42:04] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423
[16:59:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423 (owner: 10Alexandros Kosiaris)
[17:00:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Depool poolcounter1004 for reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811423 (owner: 10Alexandros Kosiaris)
[17:00:26] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:04:22] <icinga-wm>	 PROBLEM - Check systemd state on poolcounter1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:08] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:13] <inflatador>	 !log bking@cloudelastic1006 "restarting elastic services in preparation for cloudelastic reimage T309343"
[17:06:14] <logmsgbot>	 !log akosiaris@deploy1002 Synchronized wmf-config/ProductionServices.php: (no justification provided) (duration: 03m 38s)
[17:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:17] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[17:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:06:56] <icinga-wm>	 RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:07:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:42] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10ORES, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10akosiaris) 05Open→03Resolved a:03akosiaris Done a long time ago. Now [misc_redis](https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc) is being us...
[17:10:48] <wikibugs>	 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10akosiaris)
[17:17:32] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:43] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] Remove the need for charts to define services_procxy fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/811752 (owner: 10JMeybohm)
[17:19:19] <wikibugs>	 (03PS2) 10Dzahn: admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350)
[17:31:17] <wikibugs>	 (03CR) 10David Caro: novafullstack: Refactor and minor fix (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/811316 (owner: 10David Caro)
[17:31:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson cloudcephosd1021 has been moved to cloudsw2, thanks to @cmooney for figuring o...
[17:31:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The server has been moved
[17:31:17] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985)
[17:31:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn)
[17:32:13] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:32:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add gitlab-roots group to gitlab_runner role [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn)
[17:32:59] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz)
[17:33:02] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8811.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[17:36:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:38:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[17:41:22] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz)
[17:42:18] <mutante>	 well, jenkins, -1 for reason "aborted" is unusual
[17:43:31] <sukhe>	 12m 43s runtime suggests it timed out?
[17:43:44] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz)
[17:44:38] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10User-DannyS712: 502 error on beta commons - https://phabricator.wikimedia.org/T250103 (10AlexisJazz)
[17:44:58] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/809018 (https://phabricator.wikimedia.org/T308350) (owner: 10Dzahn)
[17:45:07] <sukhe>	 mutante: sorry but you taught me this :P
[17:46:38] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:59] <mutante>	 sukhe: hehe, thank you!
[17:49:12] <mutante>	 and it worked
[17:49:18] <sukhe>	 :P
[17:49:39] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you move these servers out of wmcs rack and into a 10G rack. there is space in B2, D2...
[17:52:22] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Merging, presto will not be auto-restarted, so I can do that after I upgrade the package." [puppet] - 10https://gerrit.wikimedia.org/r/811739 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[17:53:20] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:58] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Cmjohnson)
[17:54:57] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudstore100[89] - https://phabricator.wikimedia.org/T311844 (10Cmjohnson) 05Open→03Resolved These servers have been removed along with the storage arrays
[17:55:34] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudcephmon1002.eqiad.wmnet with reason: Moving racks
[17:55:36] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudcephmon1002.eqiad.wmnet with reason: Moving racks
[17:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:36] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:50] <icinga-wm>	 PROBLEM - Host cloudcephmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:00:04] <jouncebot>	 jnuche and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1800).
[18:00:04] <jouncebot>	 jnuche and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T1800). Please do the needful.
[18:02:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343
[18:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:11] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[18:02:53] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] site/DHCP: decom doc1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/810400 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn)
[18:03:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "This will turn off welcome survey at beta enwiki and eswiki; I don't think that's intended." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno)
[18:03:38] <mutante>	 Krinkle: ready to go? ok, then I'll delete that whole thing today
[18:04:10] <mutante>	 like..destroying the VM
[18:04:28] <wikibugs>	 (03PS1) 10Ottomata: Enable iceberg hive for presto [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525)
[18:06:10] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36204/console" [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[18:06:16] <wikibugs>	 (03CR) 10Ottomata: [WIP] Build spark assembly for Spark3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810951 (https://phabricator.wikimedia.org/T310578) (owner: 10Aqu)
[18:06:22] <icinga-wm>	 RECOVERY - Host cloudcephmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[18:07:01] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1 C: 03+2] Enable iceberg hive for presto [puppet] - 10https://gerrit.wikimedia.org/r/811759 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata)
[18:07:16] <icinga-wm>	 PROBLEM - Host ms-be1065.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for DDesouza - https://phabricator.wikimedia.org/T312271 (10DDeSouza)
[18:10:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye
[18:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:15] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye
[18:11:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for DDesouza - https://phabricator.wikimedia.org/T312271 (10DDeSouza)
[18:13:42] <icinga-wm>	 RECOVERY - Host ms-be1065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[18:15:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-Services, 10DC-Ops, and 2 others: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 - https://phabricator.wikimedia.org/T304096 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The server has been moved to D5 and is accessible
[18:18:42] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Management flapping will be an ongoing issue,  no need to keep this ticket open. If problems pers...
[18:27:43] <wikibugs>	 (03PS1) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[18:28:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[18:29:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @RKemper @Marostegui @Dzahn @MoritzMuehlenhoff  @ssastry Do any of your servers require 10G?  I should be able to keep them all in row D, this would only be an in-row move and woul...
[18:30:02] <wikibugs>	 (03PS2) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[18:30:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[18:33:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:39:12] <wikibugs>	 (03PS3) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[18:45:24] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye
[18:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:29] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors...
[18:45:51] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343
[18:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:55] <stashbot>	 T309343: Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343
[18:47:35] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye
[18:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:40] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye
[18:47:42] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1003.wikimedia.org with OS bullseye
[18:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:44] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Enable sticky header edit A/B test for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[18:47:47] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors...
[18:48:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye
[18:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:36] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye
[18:48:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1003.wikimedia.org with OS bullseye
[18:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:42] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors...
[18:51:28] <wikibugs>	 (03PS4) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[18:52:14] <wikibugs>	 (03CR) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[18:56:27] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1003.wikimedia.org with OS bullseye
[19:00:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:31] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye
[19:00:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook.  I'm a little confused here regarding onl...
[19:03:09] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8803.service,thumbor@8805.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:07:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:25] <wikibugs>	 (03PS1) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847)
[19:11:31] <wikibugs>	 (03PS1) 10Ebernhardson: superset: Turn template processing back on [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134)
[19:12:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10nskaggs) Thank you @ayounsi !
[19:13:16] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1003.wikimedia.org with OS bullseye
[19:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:20] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors...
[19:15:56] <wikibugs>	 (03CR) 10Ebernhardson: "Patch is based on https://github.com/apache/superset/issues/12487#issuecomment-759390836" [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) (owner: 10Ebernhardson)
[19:16:27] <wikibugs>	 (03PS5) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[19:32:21] <wikibugs>	 (03PS6) 10Clare Ming: Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144)
[19:50:41] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+1] Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[19:54:01] <logmsgbot>	 !log bd808@mwmaint1002 Testing statshbot following deploy of [[gerrit:809732]]. This should be logged in SAL, but stashbot should not say that was done on irc.
[19:54:29] <bd808>	 Krinkle: ^ seems to have worked -- https://sal.toolforge.org/log/N8sT1YEBa_6PSCT9nOPQ
[19:56:20] <bd808>	 Stashbot should no longer ack !log messages sent here by logmsgbot. This is hoped to help reduce the noise in this channel a little bit.
[19:57:02] <bd808>	 if you find this to be a good thing, give Krinkle your praise. If you find it to be horrible, blame me for merging the change. ;)
[19:57:14] <urbanecm>	 bd808: I'm not sure that's a good idea. how will i know when stashbot is broken?
[19:58:56] <mutante>	 +1, seems like we would not notice when logs dont actually get logged
[19:59:08] <bd808>	 urbanecm: when things stop showing up on https://wikitech.wikimedia.org/wiki/Server_Admin_Log I guess. I'm open to reverting if folks find it actually bad in practice, but maybe we can give it a few days before deciding?
[19:59:45] <urbanecm>	 well, I likely won't check that page when doing deployments. a missing IRC message is easy to notice, as i monitor -operations during deployments anyway :)
[19:59:48] <bd808>	 it is currently only omitting the ack message when logmsgbot is the sender of the !log
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220706T2000).
[20:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <cjming>	 o/
[20:00:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:24] <cjming>	 i will deploy
[20:00:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) Looks like those two tasks are complete (thanks @Cmjohnson !), and it seems netbox plans show connecting to cloudsw1* as suggested. Thanks!
[20:00:54] * urbanecm waves to cjming 
[20:01:12] <mutante>	 I wouldn't want to have to open SAL each time to check.
[20:01:25] <bd808>	 urbanecm: can you point to documentation of a time you noticed that stashbot was down based on activity in this channel? I do understand the concern, but I also can't recall the last report of the bot being broken coming from here.
[20:02:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Cmjohnson) @ayounsi confirmed they're all 1G, I added the racks and U# to the timeslots
[20:02:32] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[20:03:21] <bd808>	 https://sal.toolforge.org/tools.stashbot actually shows very few forced restarts of stashbot in general in the last couple of years
[20:03:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811762 (https://phabricator.wikimedia.org/T311144) (owner: 10Clare Ming)
[20:03:39] * cjming waves to urbanecm
[20:03:51] <urbanecm>	 bd808: i can have a look, but it was in the form of getting someone via IRC to restart it, so that's hard to find.
[20:04:43] <bd808>	 urbanecm: fair enough. It is always easy to revert that change if there is actual value in the ack messages here.
[20:05:42] <RhinosF1>	 Even if it netsplits and it's just waiting to come back, now there's no indication
[20:05:52] <RhinosF1>	 People hide quit messages and it's easier to loose them
[20:07:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:09:07] <cjming>	 i'll hang around for a bit in case anyone still needs something deployed -- otherwise i'll close the backport window in 10-15
[20:09:26] <Krinkle>	 urbanecm: mutante: this only affects automated messages by logmsgbot, human !log will be ack'ed the same as before, nothing changes. If it's hiding it for humans, I've made a mistake.
[20:10:14] <Krinkle>	 e.g. reimage and db maintenance basically
[20:10:20] <urbanecm>	 i understand that, but even seeing the acks by stashbot to logmsgbot's !log is useful to ensure stashbot does log the logs to SAL
[20:10:38] <RhinosF1>	 Krinkle: isn't it all cookbooks
[20:10:38] <urbanecm>	 with this change, i need to either open SAL and check, or issue a manual !log to test it
[20:10:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:11:07] <RhinosF1>	 cjming: did you ever sync
[20:11:23] <urbanecm>	 if we want to hide the automated messages for some reason, I'd suggest merging logmsgbot and stashbot to a single bot, responsible for both logging to SAL and logging to iRC
[20:11:28] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811762|Enable sticky header edit A/B test for pilot wikis excluding idwiki/viwiki (T311144)]] (duration: 03m 25s)
[20:11:31] <stashbot>	 T311144: Enable sticky header A/B test - https://phabricator.wikimedia.org/T311144
[20:11:37] <cjming>	 RhinosF1: i'm syncing now - looks like it just finished
[20:11:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:11:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:52] <Krinkle>	 urbanecm: I
[20:12:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:14:17] <wikibugs>	 (03PS1) 10Cmjohnson: adding new wmcs hosts to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/811771 (https://phabricator.wikimedia.org/T304888)
[20:15:14] <Krinkle>	 urbanecm: I don't think deployers should worry about SAL and afaik people generally do not look for the ack of automated messages to know that it made it there.
[20:15:38] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] adding new wmcs hosts to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/811771 (https://phabricator.wikimedia.org/T304888) (owner: 10Cmjohnson)
[20:15:47] <Krinkle>	 I also think for incidents etc we already go off the IRC log anyway, not SAL. Perhas an inciga alert would be useful to check that SAL is up.
[20:16:21] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:17:55] <urbanecm>	 I'm not sure about others, but for my deployments, when stashbot didn't ack the log (or reported an error), i paused to get that fixed somehow (as i don't like making changes that aren't properly logged, so other people know what i did)
[20:17:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson)
[20:18:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson)
[20:18:45] <mutante>	 an Icinga alert needs to actually notify someone though
[20:18:52] <mutante>	 or it's just going to sit there as unhandled crit 
[20:21:42] <mutante>	 yes, I do look for the ACK and SAL is the place to check what others did on a regular basis
[20:23:43] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bullseye
[20:23:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikime...
[20:33:53] <icinga-wm>	 RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:40] <cjming>	 !log end of UTC late backport window
[20:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:11] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1001.wikimedia.org with OS bullseye
[20:36:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye e...
[20:38:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.wikimedia.org with OS bullseye
[20:38:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye
[20:41:07] <icinga-wm>	 PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8801.service,thumbor@8802.service,thumbor@8803.service,thumbor@8806.service,thumbor@8808.service,thumbor@8810.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:42] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.wikimedia.org with OS bullseye
[20:43:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye
[20:43:49] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.wikimedia.org with OS bullseye
[20:43:56] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye
[20:43:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye
[20:44:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye
[20:44:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye
[20:44:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye
[20:44:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices1005.wikimedia.org with OS bullseye
[20:44:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bull...
[20:59:26] <wikibugs>	 (03CR) 10Bearloga: [C: 03+1] superset: Turn template processing back on [puppet] - 10https://gerrit.wikimedia.org/r/811766 (https://phabricator.wikimedia.org/T312134) (owner: 10Ebernhardson)
[20:59:36] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1001.wikimedia.org with OS bullseye
[20:59:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye e...
[21:01:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) I am getting this on all but the cloudnets, those are not hitting the installer.   ────────────────────┤ [!!] Configure the network ├─...
[21:08:43] <wikibugs>	 10SRE, 10DC-Ops, 10Discovery-Search (Current work): Upgrade cloudelastic clusters to Debian Bullseye - https://phabricator.wikimedia.org/T309343 (10bking)
[21:11:43] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:15:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:54] <wikibugs>	 (03PS2) 10Dbrant: Add sampling to android.breadcrumbs event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811765 (https://phabricator.wikimedia.org/T310847)
[21:22:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:30:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1002.wikimedia.org with OS bullseye
[21:39:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudrabbit1003.wikimedia.org with OS bullseye
[21:40:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye e...
[21:40:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye e...
[21:40:12] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1005.eqiad.wmnet with OS bullseye
[21:40:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye execut...
[21:40:19] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye
[21:40:23] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices1005.wikimedia.org with OS bullseye
[21:40:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye execut...
[21:40:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye...
[21:49:21] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JArguello-WMF)
[21:49:36] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10JArguello-WMF)
[21:49:46] <wikibugs>	 10SRE, 10Data-Engineering: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10JArguello-WMF)
[22:00:54] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8035453, @Dzahn wrote: >>>! In T310738#8033789, @LSobanski wrote: >> @Varnent After chatting about this...
[22:13:03] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:21:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) The change has been approved and then deployed. On gitlab-runner1002 I saw puppet ad...
[22:26:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Jclark-ctr) @ayounsi  host will be moved tomorrow morning  When i started racking task i went by  Racking Proposal: Place in WMCS racks. Place...
[22:31:23] <wikibugs>	 (03PS1) 10BCornwall: varnish: Enable Prometheus sysctl exporting [puppet] - 10https://gerrit.wikimedia.org/r/811780
[22:31:48] <wikibugs>	 (03PS2) 10BCornwall: varnish: Enable Prometheus sysctl exporting [puppet] - 10https://gerrit.wikimedia.org/r/811780 (https://phabricator.wikimedia.org/T311445)
[22:34:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 4 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10Dzahn) 05Open→03Resolved ` [gitlab-runner1002:~] $ for relenguser in brennen dancy dduv...
[22:35:50] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@debd402]: airflow dags to generate subgraph and query mapping along with their metrics
[22:37:51] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@debd402]: airflow dags to generate subgraph and query mapping along with their metrics (duration: 02m 01s)
[22:50:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[22:52:05] <ebernhardson>	 !log restart airflow-webserver and airflow-scheduler for plugins update on an-airflow1001
[22:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:39] <mutante>	 !log etherpad - deleted 2 pads that had leaked information
[22:52:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this removed snippets from /etc/rsyslog.d/, like /etc/rsyslog.d/20-rsync-data-backup-gitlab1001-wikimedia-org.conf from gitlab1004" [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[23:00:01] <mutante>	 !log gitlab1004 - rm /lib/systemd/system/rsync-config-backup-gitlab1001*  T307142
[23:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <stashbot>	 T307142: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142
[23:00:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:01:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[23:03:26] <wikibugs>	 (03PS2) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142)
[23:07:30] <wikibugs>	 (03PS3) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142)
[23:07:43] <wikibugs>	 (03PS1) 10Dzahn: site/gitlab: remove gitlab1001, update comments [puppet] - 10https://gerrit.wikimedia.org/r/811782 (https://phabricator.wikimedia.org/T307142)
[23:07:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:45] <wikibugs>	 (03PS4) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142)
[23:10:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[23:14:37] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:57] <wikibugs>	 (03Abandoned) 10Dzahn: site: remove gitlab1001, adjust gitlab machine descriptions [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[23:20:17] <wikibugs>	 (03PS4) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[23:21:59] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor@8802.service,thumbor@8803.service,thumbor@8804.service,thumbor@8805.service,thumbor@8807.service,thumbor@8812.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:30] <wikibugs>	 (03PS5) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[23:25:50] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab1001.wikimedia.org
[23:30:29] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[23:48:20] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5082f17]: increase subgraph_mapping_weekly executor memory
[23:50:26] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5082f17]: increase subgraph_mapping_weekly executor memory (duration: 02m 05s)