[00:03:36] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:16:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[00:16:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[00:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21053 and previous config saved to /var/cache/conftool/dbconfig/20220221-001641-ladsgroup.json
[00:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:49] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[00:36:32] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:41:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21054 and previous config saved to /var/cache/conftool/dbconfig/20220221-004128-ladsgroup.json
[00:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:34] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[00:56:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21055 and previous config saved to /var/cache/conftool/dbconfig/20220221-005632-ladsgroup.json
[00:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21056 and previous config saved to /var/cache/conftool/dbconfig/20220221-011137-ladsgroup.json
[01:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:48] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[01:20:04] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:26:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21057 and previous config saved to /var/cache/conftool/dbconfig/20220221-012642-ladsgroup.json
[01:26:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[01:26:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[01:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:49] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[01:26:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21058 and previous config saved to /var/cache/conftool/dbconfig/20220221-012649-ladsgroup.json
[01:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21059 and previous config saved to /var/cache/conftool/dbconfig/20220221-013429-ladsgroup.json
[01:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:35] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[01:38:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:38:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[01:38:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[01:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T302185)', diff saved to https://phabricator.wikimedia.org/P21060 and previous config saved to /var/cache/conftool/dbconfig/20220221-013811-ladsgroup.json
[01:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:18] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[01:39:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2152.codfw.wmnet with OS bullseye
[01:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:40:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[01:46:36] <icinga-wm>	 RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[01:49:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21061 and previous config saved to /var/cache/conftool/dbconfig/20220221-014934-ladsgroup.json
[01:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2152.codfw.wmnet with reason: host reimage
[01:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:57:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: host reimage
[01:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:04:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21062 and previous config saved to /var/cache/conftool/dbconfig/20220221-020438-ladsgroup.json
[02:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:13:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2152.codfw.wmnet with OS bullseye
[02:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:32] <wikibugs>	 (03PS1) 10Ladsgroup: Add add_linter_template_T300402.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300402)
[02:16:17] <wikibugs>	 (03PS2) 10Ladsgroup: Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992)
[02:19:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21063 and previous config saved to /var/cache/conftool/dbconfig/20220221-021943-ladsgroup.json
[02:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[02:19:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[02:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:19:52] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[02:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:02] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:22:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T302185)', diff saved to https://phabricator.wikimedia.org/P21064 and previous config saved to /var/cache/conftool/dbconfig/20220221-022259-ladsgroup.json
[02:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:05] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[02:31:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2091.codfw.wmnet with reason: Maintenance
[02:31:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2091.codfw.wmnet with reason: Maintenance
[02:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2091 (T302185)', diff saved to https://phabricator.wikimedia.org/P21065 and previous config saved to /var/cache/conftool/dbconfig/20220221-023158-ladsgroup.json
[02:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:32:04] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[02:32:53] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[02:33:37] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:34:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2091.codfw.wmnet with OS bullseye
[02:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[02:38:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[02:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[02:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[02:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21066 and previous config saved to /var/cache/conftool/dbconfig/20220221-023852-ladsgroup.json
[02:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:39:01] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[02:49:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2091.codfw.wmnet with reason: host reimage
[02:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:53:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2091.codfw.wmnet with reason: host reimage
[02:53:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:55:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21067 and previous config saved to /var/cache/conftool/dbconfig/20220221-025534-ladsgroup.json
[02:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:55:40] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[03:08:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2091.codfw.wmnet with OS bullseye
[03:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:10:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21068 and previous config saved to /var/cache/conftool/dbconfig/20220221-031039-ladsgroup.json
[03:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2091 (T302185)', diff saved to https://phabricator.wikimedia.org/P21069 and previous config saved to /var/cache/conftool/dbconfig/20220221-031602-ladsgroup.json
[03:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:09] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[03:25:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Maintenance
[03:25:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Maintenance
[03:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21070 and previous config saved to /var/cache/conftool/dbconfig/20220221-032548-ladsgroup.json
[03:25:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2084 (T302185)', diff saved to https://phabricator.wikimedia.org/P21071 and previous config saved to /var/cache/conftool/dbconfig/20220221-032548-ladsgroup.json
[03:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:26:00] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[03:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2084.codfw.wmnet with OS bullseye
[03:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:39:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2084.codfw.wmnet with reason: host reimage
[03:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:40:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21072 and previous config saved to /var/cache/conftool/dbconfig/20220221-034052-ladsgroup.json
[03:40:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[03:40:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[03:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:40:58] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[03:41:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21073 and previous config saved to /var/cache/conftool/dbconfig/20220221-034100-ladsgroup.json
[03:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:42:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2084.codfw.wmnet with reason: host reimage
[03:42:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:46:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:47:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21074 and previous config saved to /var/cache/conftool/dbconfig/20220221-034836-ladsgroup.json
[03:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:48:43] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[03:56:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2084.codfw.wmnet with OS bullseye
[03:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:03:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21075 and previous config saved to /var/cache/conftool/dbconfig/20220221-040341-ladsgroup.json
[04:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:11:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2084 (T302185)', diff saved to https://phabricator.wikimedia.org/P21076 and previous config saved to /var/cache/conftool/dbconfig/20220221-041123-ladsgroup.json
[04:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:11:30] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[04:15:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2083.codfw.wmnet with reason: Maintenance
[04:15:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2083.codfw.wmnet with reason: Maintenance
[04:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:15:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2083 (T302185)', diff saved to https://phabricator.wikimedia.org/P21077 and previous config saved to /var/cache/conftool/dbconfig/20220221-041529-ladsgroup.json
[04:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2083.codfw.wmnet with OS bullseye
[04:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21078 and previous config saved to /var/cache/conftool/dbconfig/20220221-041846-ladsgroup.json
[04:18:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:30:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2083.codfw.wmnet with reason: host reimage
[04:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21079 and previous config saved to /var/cache/conftool/dbconfig/20220221-043350-ladsgroup.json
[04:33:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[04:33:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[04:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:33:56] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[04:33:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21080 and previous config saved to /var/cache/conftool/dbconfig/20220221-043358-ladsgroup.json
[04:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2083.codfw.wmnet with reason: host reimage
[04:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:26] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:39:32] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:48:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2083.codfw.wmnet with OS bullseye
[04:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2083 (T302185)', diff saved to https://phabricator.wikimedia.org/P21081 and previous config saved to /var/cache/conftool/dbconfig/20220221-045516-ladsgroup.json
[04:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:23] <stashbot>	 T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185
[05:00:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21082 and previous config saved to /var/cache/conftool/dbconfig/20220221-050050-ladsgroup.json
[05:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:56] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[05:15:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21083 and previous config saved to /var/cache/conftool/dbconfig/20220221-051555-ladsgroup.json
[05:15:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21084 and previous config saved to /var/cache/conftool/dbconfig/20220221-053059-ladsgroup.json
[05:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[05:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21085 and previous config saved to /var/cache/conftool/dbconfig/20220221-054604-ladsgroup.json
[05:46:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[05:46:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[05:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:11] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[05:46:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21086 and previous config saved to /var/cache/conftool/dbconfig/20220221-054612-ladsgroup.json
[05:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Marostegui) p:05Triage→03Medium Disk #3 is gone: ` # megacli -PDList -aALL | grep Slot Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slot Number: 7 Slot Number: 8...
[06:07:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21087 and previous config saved to /var/cache/conftool/dbconfig/20220221-060701-ladsgroup.json
[06:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:08] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[06:07:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[06:08:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[06:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300775)', diff saved to https://phabricator.wikimedia.org/P21088 and previous config saved to /var/cache/conftool/dbconfig/20220221-060804-marostegui.json
[06:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:11] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[06:11:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:12:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[06:12:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21089 and previous config saved to /var/cache/conftool/dbconfig/20220221-061205-marostegui.json
[06:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:12] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[06:13:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:14:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:17:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21090 and previous config saved to /var/cache/conftool/dbconfig/20220221-061719-marostegui.json
[06:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:26] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[06:18:43] <wikibugs>	 (03PS1) 10Marostegui: db1107: Move from m3 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/764111 (https://phabricator.wikimedia.org/T301654)
[06:20:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1107.eqiad.wmnet with OS bullseye
[06:20:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1107: Move from m3 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/764111 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui)
[06:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21091 and previous config saved to /var/cache/conftool/dbconfig/20220221-062206-ladsgroup.json
[06:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1107.eqiad.wmnet with reason: host reimage
[06:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1107.eqiad.wmnet with reason: host reimage
[06:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21092 and previous config saved to /var/cache/conftool/dbconfig/20220221-063223-marostegui.json
[06:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:53] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[06:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21093 and previous config saved to /var/cache/conftool/dbconfig/20220221-063713-ladsgroup.json
[06:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup)
[06:41:46] <marostegui>	  !log Stop mysql on db1117:3325 to clone db1107 - T301654
[06:41:46] <stashbot>	 T301654: Upgrade m5 to Bullseye - https://phabricator.wikimedia.org/T301654
[06:46:03] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:46:25] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:46:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1107.eqiad.wmnet with OS bullseye
[06:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:24] <marostegui>	 haproxy alerts are expected
[06:47:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21095 and previous config saved to /var/cache/conftool/dbconfig/20220221-064728-marostegui.json
[06:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup)
[06:48:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10wiki_willy) a:03Cmjohnson
[06:48:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: cloudvirt1017.mgmt/SSH - https://phabricator.wikimedia.org/T302016 (10wiki_willy) a:03Cmjohnson
[06:48:44] <wikibugs>	 (03Merged) 10jenkins-bot: Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup)
[06:49:09] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:49:33] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:50:25] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:34] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10wiki_willy) Hi @ayounsi - I'm not sure if you're copied on the Interxion ticket, so just forwarding the info along that they completed th...
[06:52:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21096 and previous config saved to /var/cache/conftool/dbconfig/20220221-065220-ladsgroup.json
[06:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[06:52:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[06:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:26] <stashbot>	 T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554
[06:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance
[06:53:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance
[06:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:27] <wikibugs>	 (03CR) 10Elukey: install_server: set new partman recipe for kubestage1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:01:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:02:12] <wikibugs>	 (03PS2) 10Elukey: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:02:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:02:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21097 and previous config saved to /var/cache/conftool/dbconfig/20220221-070233-marostegui.json
[07:02:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:02:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:39] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[07:02:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21098 and previous config saved to /var/cache/conftool/dbconfig/20220221-070240-marostegui.json
[07:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:41] <wikibugs>	 (03PS3) 10Elukey: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:04:43] <wikibugs>	 (03PS2) 10Elukey: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:07:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add etwiki and fawiki editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/763773 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze)
[07:08:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21099 and previous config saved to /var/cache/conftool/dbconfig/20220221-070822-marostegui.json
[07:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:29] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[07:08:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[07:08:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[07:08:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[07:09:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[07:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[07:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[07:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[07:11:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[07:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[07:15:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[07:18:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[07:20:45] <wikibugs>	 (03PS1) 10Elukey: install_server: move kubestage[12]* nodes to overlayfs partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/764291 (https://phabricator.wikimedia.org/T300744)
[07:23:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21100 and previous config saved to /var/cache/conftool/dbconfig/20220221-072326-marostegui.json
[07:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
[07:30:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
[07:30:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[07:30:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[07:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:11] <icinga-wm>	 RECOVERY - Host asw1-b13-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms
[07:34:11] <icinga-wm>	 RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 85.63 ms
[07:34:11] <icinga-wm>	 RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 92.12 ms
[07:34:29] <icinga-wm>	 RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.drmrs.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[07:34:43] <icinga-wm>	 RECOVERY - Recursive DNS on 2a02:ec80:600:2:185:15:58:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[07:34:47] <icinga-wm>	 RECOVERY - Host prometheus6001 is UP: PING OK - Packet loss = 0%, RTA = 86.60 ms
[07:34:55] <wikibugs>	 (03PS1) 10Marostegui: dbstore_multiinstance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764292 (https://phabricator.wikimedia.org/T268869)
[07:35:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] install_server: move kubestage[12]* nodes to overlayfs partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/764291 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:35:23] <jinxer-wm>	 (JobUnavailable) firing: (22) Reduced availability for job bird in drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[07:35:27] <icinga-wm>	 RECOVERY - Maps edge drmrs on upload-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[07:35:46] <jinxer-wm>	 (ThanosSidecarBucketOperationsFailed) firing: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org
[07:36:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbstore_multiinstance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764292 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui)
[07:37:27] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings to kubestage2002's settings [puppet] - 10https://gerrit.wikimedia.org/r/764293 (https://phabricator.wikimedia.org/T300744)
[07:38:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21101 and previous config saved to /var/cache/conftool/dbconfig/20220221-073831-marostegui.json
[07:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:39] <wikibugs>	 (03PS1) 10Marostegui: misc.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764294 (https://phabricator.wikimedia.org/T268869)
[07:39:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings to kubestage2002's settings [puppet] - 10https://gerrit.wikimedia.org/r/764293 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[07:39:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] misc.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764294 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui)
[07:40:07] <marostegui>	 elukey: ok to merge your change?
[07:40:21] <MatmaRex>	 aww, no backport window today? :(
[07:40:39] <elukey>	 marostegui: <3
[07:40:44] <marostegui>	 done!
[07:40:46] <jinxer-wm>	 (ThanosSidecarBucketOperationsFailed) resolved: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org
[07:43:51] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005895 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[07:48:38] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye
[07:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:54] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add fiwiki & frwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764295 (https://phabricator.wikimedia.org/T301415)
[07:53:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21102 and previous config saved to /var/cache/conftool/dbconfig/20220221-075336-marostegui.json
[07:53:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[07:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:42] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[07:53:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[07:53:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[07:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[07:53:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:59] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki
[07:57:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:57:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21103 and previous config saved to /var/cache/conftool/dbconfig/20220221-075800-marostegui.json
[07:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[07:59:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220221T0800)
[08:00:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[08:01:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763971
[08:02:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763971 (owner: 10Marostegui)
[08:02:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297
[08:02:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21104 and previous config saved to /var/cache/conftool/dbconfig/20220221-080248-marostegui.json
[08:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:53] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[08:05:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297 (owner: 10Muehlenhoff)
[08:05:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add fiwiki & frwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764295 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira)
[08:05:32] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297
[08:07:19] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[08:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[08:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:23] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[08:10:33] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[08:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:45] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[08:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:03] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:13:09] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I can confirm that (1), (2) and (4) are done.  However cr2-drmrs is currently fully down (console is dead as well). My guess is...
[08:14:03] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:14:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[08:14:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[08:17:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21105 and previous config saved to /var/cache/conftool/dbconfig/20220221-081752-marostegui.json
[08:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[08:21:11] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299
[08:21:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye
[08:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299 (owner: 10Marostegui)
[08:22:44] <godog>	 !log update karma to 0.99 on alert* hosts - T284213
[08:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:49] <stashbot>	 T284213: Improve AlertManager dashboard - https://phabricator.wikimedia.org/T284213
[08:23:16] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299 (owner: 10Marostegui)
[08:30:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: k8s: add module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto)
[08:32:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21106 and previous config saved to /var/cache/conftool/dbconfig/20220221-083257-marostegui.json
[08:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:05] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:38:11] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc
[08:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for rhuang-ctr [puppet] - 10https://gerrit.wikimedia.org/r/764302
[08:43:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for rhuang-ctr [puppet] - 10https://gerrit.wikimedia.org/r/764302 (owner: 10Muehlenhoff)
[08:48:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21107 and previous config saved to /var/cache/conftool/dbconfig/20220221-084802-marostegui.json
[08:48:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[08:48:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[08:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:08] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[08:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1009.eqiad.wmnet with OS buster
[08:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS buster
[08:51:40] <wikibugs>	 (03PS1) 10Marostegui: change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294)
[08:52:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:52:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:57:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[08:57:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21108 and previous config saved to /var/cache/conftool/dbconfig/20220221-085745-marostegui.json
[08:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:55] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[08:58:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: route source=icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/764304 (https://phabricator.wikimedia.org/T300951)
[08:59:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui)
[09:00:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui)
[09:00:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi)
[09:00:57] <wikibugs>	 (03Merged) 10jenkins-bot: change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui)
[09:01:26] <wikibugs>	 (03PS3) 10Elukey: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[09:01:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add relabels to rule [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi)
[09:01:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: inject 'source' label to alerts [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi)
[09:02:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route source=icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/764304 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi)
[09:02:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm)
[09:03:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21109 and previous config saved to /var/cache/conftool/dbconfig/20220221-090305-marostegui.json
[09:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:12] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[09:03:29] <godog>	 elukey: merged your change too
[09:03:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage
[09:03:35] <elukey>	 <3
[09:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1003.eqiad.wmnet with OS bullseye
[09:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage
[09:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:31] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:10:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:14:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:14:36] <elukey>	 this is me --^
[09:15:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:16:53] <wikibugs>	 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I gave a call to Tarek: the power cord on cr2 was faulty, but he was able to find 2 spare ones which he will bill on the ticket....
[09:18:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21110 and previous config saved to /var/cache/conftool/dbconfig/20220221-091809-marostegui.json
[09:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage
[09:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:00] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade
[09:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bullseye
[09:22:07] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2001.codfw.wmnet with OS bullseye
[09:22:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:26] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bullseye
[09:22:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21111 and previous config saved to /var/cache/conftool/dbconfig/20220221-092226-root.json
[09:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:37] <wikibugs>	 (03PS16) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774)
[09:24:01] <godog>	 !log deploy prometheus-icinga-exporter 0.19 - T300951
[09:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:06] <stashbot>	 T300951: Add 'source' tag to icinga and prometheus/thanos alerts - https://phabricator.wikimedia.org/T300951
[09:24:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] r_lang::bioc: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro)
[09:24:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage
[09:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] product_analytics: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro)
[09:25:06] <wikibugs>	 (03CR) 10David Caro: r_lang::bioc: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro)
[09:26:03] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I have rolled back to patchset 14, preallocating data does not seem to speed up disk writes and we use --snapshot so writes are done entir" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[09:27:01] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro)
[09:27:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: pass extinfo-url to icinga-exporter [puppet] - 10https://gerrit.wikimedia.org/r/763457 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi)
[09:27:17] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "minor style nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[09:30:21] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) Thanks for the swift turnarounds on these!
[09:33:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21112 and previous config saved to /var/cache/conftool/dbconfig/20220221-093314-marostegui.json
[09:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1009.eqiad.wmnet with OS buster
[09:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS buster completed: - ganeti1009 (**PASS**)...
[09:34:09] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1003.eqiad.wmnet with OS bullseye
[09:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[09:36:48] <wikibugs>	 (03PS17) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774)
[09:36:51] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879)
[09:36:53] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] ci: Qemu image and snapshot creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[09:37:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21113 and previous config saved to /var/cache/conftool/dbconfig/20220221-093729-root.json
[09:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:11] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage
[09:38:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:18] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:39:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[09:40:04] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga)
[09:40:50] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro)
[09:41:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage
[09:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[09:43:19] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:43:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto)
[09:45:37] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc
[09:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[09:48:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21114 and previous config saved to /var/cache/conftool/dbconfig/20220221-094819-marostegui.json
[09:48:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[09:48:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[09:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:25] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[09:48:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21115 and previous config saved to /var/cache/conftool/dbconfig/20220221-094826-marostegui.json
[09:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:48] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) +SRE for visibility
[09:51:17] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:51:18] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:51:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:23] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21116 and previous config saved to /var/cache/conftool/dbconfig/20220221-095122-kormat.json
[09:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:29] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[09:51:37] <kormat>	 !log running schema change against s7 T300774
[09:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:28] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bullseye
[09:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21117 and previous config saved to /var/cache/conftool/dbconfig/20220221-095233-root.json
[09:52:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21118 and previous config saved to /var/cache/conftool/dbconfig/20220221-095410-marostegui.json
[09:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:15] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[09:55:49] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:56:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[09:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:57:32] <moritzm>	 !log installing PHP 7.4 security updates (as packaged in Debian)
[09:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:50] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:59:41] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.05469 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:01:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:01:01] <marostegui>	 !log Rebuild templatelinks table on s2 codfw master (db2104), lag to be expected on codfw T301848
[10:01:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "AIUI the task manager should use no more then task_manager_mem (taskmanager.memory.process.size) memory, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse)
[10:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:06] <stashbot>	 T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848
[10:02:20] <volans>	 dcaro: it seems that r_lang::bioc is still used (was just removed by f4efb35f63) and triggering teh above Widespread puppet agent failures
[10:02:33] <volans>	 see for example https://puppetboard.wikimedia.org/report/analytics1070.eqiad.wmnet/daabf3a68ee5e983656387462f5253ff22d565d9
[10:03:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21119 and previous config saved to /var/cache/conftool/dbconfig/20220221-100737-root.json
[10:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:00] <wikibugs>	 (03PS1) 10Ayounsi: Add drmrs interco v6 PTRs [dns] - 10https://gerrit.wikimedia.org/r/764314
[10:08:04] <dcaro>	 volans: ack, looking
[10:08:25] <volans>	 thanks
[10:09:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21120 and previous config saved to /var/cache/conftool/dbconfig/20220221-100914-marostegui.json
[10:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:00] <wikibugs>	 (03PS1) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977
[10:10:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro)
[10:11:24] <dcaro>	 I think that the issues is the biocLite.R file only, the rest is good to go
[10:12:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update default prometheus-statsd-exporter version to 0.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/762463 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm)
[10:14:11] <wikibugs>	 (03PS2) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977
[10:14:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro)
[10:15:00] <wikibugs>	 (03PS3) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977
[10:15:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[10:15:20] <dcaro>	 volans: feel free to do a quick review, only restored the offending file (that was removed in a last patch)
[10:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:01] <volans>	 dcaro: ack, looking
[10:16:15] <volans>	 althouhg I have zero context on those modules :)
[10:16:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[10:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549)
[10:17:48] <wikibugs>	 (03CR) 10DCausse: flink-session-cluster: increase task manager mem limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse)
[10:19:21] <wikibugs>	 (03PS1) 10David Caro: r_lang: remove unused biocLite.R [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559)
[10:20:06] <wikibugs>	 (03CR) 10David Caro: "Not sure if this is correct, but seems like it, @mpopov will know better." [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro)
[10:20:54] <dcaro>	 volans: it was a cleanup that got an updated patch in the last minute removing that file (thinking that it was not used anymore), sent a followup patch removing the file and the entry and adding mpopov as reviewer (the person with the context)
[10:21:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, noop on PCC" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro)
[10:21:27] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro)
[10:21:57] <dcaro>	 merged, should stop the errors
[10:22:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21121 and previous config saved to /var/cache/conftool/dbconfig/20220221-102241-root.json
[10:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:56] <volans>	 ack, thanks for the fix
[10:24:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21122 and previous config saved to /var/cache/conftool/dbconfig/20220221-102419-marostegui.json
[10:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:53] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi)
[10:30:20] <Lucas_WMDE>	 !log Deployed patch for T302192
[10:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bullseye
[10:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) That was by design, the parameters used are defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/refs/head...
[10:34:29] <wikibugs>	 10SRE, 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere)
[10:34:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi)
[10:35:36] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[10:35:52] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings for kubestage1004 [puppet] - 10https://gerrit.wikimedia.org/r/764322
[10:38:21] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10MatthewVernon) p:05Triage→03Low a:03Ladsgroup
[10:39:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21123 and previous config saved to /var/cache/conftool/dbconfig/20220221-103924-marostegui.json
[10:39:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[10:39:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[10:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:30] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[10:39:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21124 and previous config saved to /var/cache/conftool/dbconfig/20220221-103931-marostegui.json
[10:39:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:47] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21125 and previous config saved to /var/cache/conftool/dbconfig/20220221-104247-kormat.json
[10:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:53] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[10:45:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/764314 (owner: 10Ayounsi)
[10:46:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet
[10:46:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10serviceops, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10MatthewVernon)
[10:47:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) IMHO the client should fail faster since while running it will block dpkg/apt in such cases
[10:48:05] <wikibugs>	 (03PS1) 10Marostegui: replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323
[10:48:44] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage
[10:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet
[10:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for kubestage1004 [puppet] - 10https://gerrit.wikimedia.org/r/764322 (owner: 10Elukey)
[10:53:34] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1004.eqiad.wmnet with OS bullseye
[10:53:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0)
[10:53:40] <wikibugs>	 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10MatthewVernon) 05Open→03Stalled p:05Triage→03Low @AndyRussG I'm making this "Stalled", and "Low" priority for now, since I think really you are...
[10:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage
[10:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add drmrs interco v6 PTRs [dns] - 10https://gerrit.wikimedia.org/r/764314 (owner: 10Ayounsi)
[10:56:22] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui)
[10:57:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui)
[10:57:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1022.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[10:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:52] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21126 and previous config saved to /var/cache/conftool/dbconfig/20220221-105752-kormat.json
[10:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:59] <wikibugs>	 (03Merged) 10jenkins-bot: replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui)
[10:59:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1022.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[10:59:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[10:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:39] <wikibugs>	 (03CR) 10Jbond: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto)
[11:03:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:05:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bullseye
[11:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:05:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[11:07:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi)
[11:08:30] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Observability-Logging, 10Patch-For-Review: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) I've bandaided the issue for now, though we should go back to a short `for` clause once the root cause is fixed
[11:08:48] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005362 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:09:04] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[11:09:46] <wikibugs>	 (03CR) 10Jbond: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto)
[11:09:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage
[11:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage
[11:12:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:57] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21127 and previous config saved to /var/cache/conftool/dbconfig/20220221-111256-kormat.json
[11:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:13:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:17:31] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:18:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:18:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet
[11:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:22:31] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:23:16] <wikibugs>	 (03PS6) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535
[11:23:39] <_joe_>	 this calicodown is for staging, right elukey jayme ?
[11:24:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1004.eqiad.wmnet with OS bullseye
[11:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet
[11:24:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:47] <_joe_>	 yeah if felix was really down, we'd see way more alerts firing
[11:24:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[11:25:36] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[11:26:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1012.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[11:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:24] <elukey>	 _joe_ yep it is me reimaging 1004
[11:27:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1012.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[11:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:02] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21128 and previous config saved to /var/cache/conftool/dbconfig/20220221-112801-kormat.json
[11:28:03] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:28:05] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:28:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[11:28:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:07] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[11:28:09] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21129 and previous config saved to /var/cache/conftool/dbconfig/20220221-112809-kormat.json
[11:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:15] <wikibugs>	 (03PS1) 10Jbond: P:netbox: ltidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[11:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:31] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc
[11:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi)
[11:29:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: ltidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[11:30:22] <wikibugs>	 (03Merged) 10jenkins-bot: drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi)
[11:33:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 77): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33868/console" [puppet] - 10https://gerrit.wikimedia.org/r/762535 (owner: 10Jbond)
[11:39:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21130 and previous config saved to /var/cache/conftool/dbconfig/20220221-113950-marostegui.json
[11:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:57] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[11:40:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:07] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21131 and previous config saved to /var/cache/conftool/dbconfig/20220221-114307-kormat.json
[11:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:14] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[11:44:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:44:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:10] <wikibugs>	 (03PS2) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[11:48:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:48:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[11:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21132 and previous config saved to /var/cache/conftool/dbconfig/20220221-115455-marostegui.json
[11:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:10] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp6010 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:55:24] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:55:34] <icinga-wm>	 PROBLEM - traffic_server tls process restarted on cp6014 is CRITICAL: 27 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls
[11:55:38] <icinga-wm>	 PROBLEM - traffic_server tls process restarted on cp6010 is CRITICAL: 25 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=tls
[11:55:50] <icinga-wm>	 PROBLEM - traffic_server backend process restarted on cp6014 is CRITICAL: 62 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=backend
[11:55:56] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:57:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 T301848', diff saved to https://phabricator.wikimedia.org/P21133 and previous config saved to /var/cache/conftool/dbconfig/20220221-115750-marostegui.json
[11:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:57] <stashbot>	 T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848
[11:58:11] <marostegui>	 !log Rebuild templatelinks table on db1129 (s2) T301848
[11:58:12] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21134 and previous config saved to /var/cache/conftool/dbconfig/20220221-115811-kormat.json
[11:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:38] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879)
[12:06:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] C:cassandra: add optional java_package variable [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond)
[12:06:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet
[12:06:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:06] <icinga-wm>	 PROBLEM - traffic_server tls process restarted on cp6016 is CRITICAL: 9 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls
[12:10:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21135 and previous config saved to /var/cache/conftool/dbconfig/20220221-120959-marostegui.json
[12:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet
[12:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:17] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21136 and previous config saved to /var/cache/conftool/dbconfig/20220221-121316-kormat.json
[12:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1017.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[12:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10Michael.Hay) >>! In T301782#7720584, @MMandere wrote: > Thank you @JBennett for the approval. @Michael.Hay please sign the [[ https://phabricator.wikimedia.org/L3 | L3...
[12:16:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[12:19:26] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6016 is CRITICAL: cluster=cache_text instance=cp6016 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016
[12:21:02] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:21:22] <wikibugs>	 (03PS1) 10Hnowlan: maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664)
[12:23:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:25:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21137 and previous config saved to /var/cache/conftool/dbconfig/20220221-122504-marostegui.json
[12:25:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:10] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[12:27:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P21138 and previous config saved to /var/cache/conftool/dbconfig/20220221-122727-marostegui.json
[12:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:23] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21139 and previous config saved to /var/cache/conftool/dbconfig/20220221-122821-kormat.json
[12:28:24] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:28:26] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:33] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:28:34] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.68e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:52] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:30:17] <Lucas_WMDE>	 !log Deployed patch for T302215
[12:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:00] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:31:16] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:31:18] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016
[12:31:42] <icinga-wm>	 PROBLEM - Check systemd state on cp6016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_intel_microcode.service,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21140 and previous config saved to /var/cache/conftool/dbconfig/20220221-123335-root.json
[12:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:34:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:35:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:35:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1017.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[12:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:52] <marostegui>	 !log Rebuild templatelinks table on db2077 (s7) T301848
[12:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:57] <stashbot>	 T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848
[12:40:26] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:40:28] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:40:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[12:42:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[12:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21141 and previous config saved to /var/cache/conftool/dbconfig/20220221-124215-marostegui.json
[12:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:23] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[12:42:50] <wikibugs>	 (03PS1) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758)
[12:45:42] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:45:43] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] Add per-subnet netboot conf files for new row E-F subnets in Eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[12:48:04] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[12:48:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21142 and previous config saved to /var/cache/conftool/dbconfig/20220221-124839-root.json
[12:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:14] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6014 is CRITICAL: cluster=cache_text instance=cp6014 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014
[12:52:57] <wikibugs>	 (03PS2) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758)
[12:53:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21143 and previous config saved to /var/cache/conftool/dbconfig/20220221-125303-marostegui.json
[12:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:10] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[12:53:20] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:53:22] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:27] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21144 and previous config saved to /var/cache/conftool/dbconfig/20220221-125326-kormat.json
[12:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:33] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:55:06] <wikibugs>	 (03PS3) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758)
[12:56:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet
[12:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet
[13:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:35] <wikibugs>	 (03PS6) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956)
[13:02:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1009.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[13:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:47] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi)
[13:03:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[13:03:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21145 and previous config saved to /var/cache/conftool/dbconfig/20220221-130343-root.json
[13:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1009.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[13:04:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:56] <moritzm>	 !log rebalance ganeti row_C (add nodes reimaged in there) T296721
[13:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:01] <stashbot>	 T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721
[13:08:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21146 and previous config saved to /var/cache/conftool/dbconfig/20220221-130808-marostegui.json
[13:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[13:10:30] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks Majavah fixed." [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[13:11:50] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[13:14:23] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21147 and previous config saved to /var/cache/conftool/dbconfig/20220221-131423-kormat.json
[13:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:29] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:15:14] <wikibugs>	 10SRE, 10DC-Ops, 10serviceops: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10akosiaris)
[13:16:37] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014
[13:18:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21148 and previous config saved to /var/cache/conftool/dbconfig/20220221-131846-root.json
[13:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33869/" [puppet] - 10https://gerrit.wikimedia.org/r/763748 (owner: 10Ayounsi)
[13:23:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21149 and previous config saved to /var/cache/conftool/dbconfig/20220221-132313-marostegui.json
[13:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763821 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite)
[13:25:04] <wikibugs>	 (03PS1) 10Krinkle: Increase logging of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451)
[13:25:26] <wikibugs>	 10SRE, 10Anti-Harassment, 10DBA: Error Unknown column  ipb_sitewide in field list on query - https://phabricator.wikimedia.org/T208462 (10DonPaolo) I upgraded to 1.37 from 1.31, and I got the error of ipb_sitewide missing.  I had to manually run "ALTER TABLE  ipblocks   ADD ipb_sitewide bool NOT NULL default...
[13:29:28] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21150 and previous config saved to /var/cache/conftool/dbconfig/20220221-132928-kormat.json
[13:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33870/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/763750 (owner: 10Ayounsi)
[13:31:42] <wikibugs>	 (03PS2) 10Ayounsi: Disable Junos alarms check by default [puppet] - 10https://gerrit.wikimedia.org/r/763750
[13:31:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[13:33:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[13:33:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21151 and previous config saved to /var/cache/conftool/dbconfig/20220221-133350-root.json
[13:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21152 and previous config saved to /var/cache/conftool/dbconfig/20220221-133818-marostegui.json
[13:38:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[13:38:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[13:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:24] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[13:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Hannah Okwelum - https://phabricator.wikimedia.org/T302212 (10Atieno) Hello.  This is approved from my end.  Cheers.
[13:43:16] <wikibugs>	 10SRE, 10DC-Ops, 10cloud-services-team (Kanban): Supporting new hardware in older debian releases - https://phabricator.wikimedia.org/T301162 (10MatthewVernon) p:05Triage→03Medium
[13:44:33] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21153 and previous config saved to /var/cache/conftool/dbconfig/20220221-134433-kormat.json
[13:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "This breaks down to an additional 6 CPUs (in limits) for cp-jobqueue (just FTR) - should (still ;-)) be fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[13:45:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[13:45:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[13:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21154 and previous config saved to /var/cache/conftool/dbconfig/20220221-134542-marostegui.json
[13:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:50] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[13:48:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga)
[13:49:05] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye
[13:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye
[13:52:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363
[13:53:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff)
[13:54:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21156 and previous config saved to /var/cache/conftool/dbconfig/20220221-135417-marostegui.json
[13:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:24] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[13:58:57] <wikibugs>	 (03PS1) 10Ayounsi: Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367
[13:59:38] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21158 and previous config saved to /var/cache/conftool/dbconfig/20220221-135937-kormat.json
[13:59:39] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[13:59:41] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[13:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:44] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:59:45] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21159 and previous config saved to /var/cache/conftool/dbconfig/20220221-135945-kormat.json
[13:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:07] <wikibugs>	 (03PS2) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363
[14:00:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T301579 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, I've done the LDAP change; it'll take an hour for the cache on gerrit to clear [I'm not the right flavour of admin...
[14:00:58] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage
[14:01:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:03] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:05:32] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage
[14:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:20] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [I think this task can be closed, since the issue was resolve...
[14:06:21] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 7074 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:06:28] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10MatthewVernon)
[14:08:06] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff)
[14:08:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21160 and previous config saved to /var/cache/conftool/dbconfig/20220221-140831-root.json
[14:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21161 and previous config saved to /var/cache/conftool/dbconfig/20220221-140922-marostegui.json
[14:09:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:09:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:55] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:09:59] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:10:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MatthewVernon) Does this still need #WMF-NDA-Requests tagging in it? It means it appears in the Clinic Duty dashboard, which is probably not what w...
[14:11:02] <wikibugs>	 (03CR) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan)
[14:16:19] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[14:17:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[14:18:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I did try to parse the templates manually based on the data in https://phabricator.wikimedia.org/P21048 and came up with the same the same" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto)
[14:19:14] <godog>	 the thanos rule alert is me
[14:19:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21162 and previous config saved to /var/cache/conftool/dbconfig/20220221-141931-root.json
[14:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:52] <godog>	 hah, the icinga configuration is busted but not sure exactly why
[14:21:53] <godog>	 Error: 'lsw1-e3-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'elastic1093' (file '/etc/icinga/objects/puppet_hosts.cfg', line 21621)!
[14:22:01] <godog>	 cc XioNoX ^ perhaps ?
[14:22:19] <XioNoX>	 I think I know
[14:22:27] <XioNoX>	 cc topranks 
[14:22:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[14:22:29] <moritzm>	 !log installing twisted security updates
[14:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:53] <godog>	 ah yeah that'd make sense, thanks
[14:22:54] <wikibugs>	 (03PS3) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[14:23:08] <XioNoX>	 there is some automation to define icinga parents automatically based on LLDP
[14:23:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[14:23:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33871/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[14:23:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21163 and previous config saved to /var/cache/conftool/dbconfig/20220221-142337-root.json
[14:23:40] <topranks>	 godog:  thanks yep that does make sense.
[14:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:13] <topranks>	 I was testing imaging that host, but yes it's "parent" switch isn't in monitoring, causing this 
[14:24:22] <topranks>	 sorry hadn't anticipated the issue, let me try to sort it out.
[14:24:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21164 and previous config saved to /var/cache/conftool/dbconfig/20220221-142426-marostegui.json
[14:24:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:41] <godog>	 sure no worries, LMK if I can help topranks 
[14:24:48] <XioNoX>	 topranks: https://github.com/wikimedia/puppet/blob/production/hieradata/common/monitoring.yaml and https://github.com/wikimedia/puppet/blob/production/modules/netops/manifests/monitoring.pp
[14:25:27] <topranks>	 cool thanks XioNoX, yeah getting close to the time to add them there.
[14:25:40] <XioNoX>	 topranks: and I sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/764367 to re-organize the monitoring file
[14:25:56] <XioNoX>	 it should make it easier for you to add your devices
[14:26:05] <wikibugs>	 (03PS4) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[14:26:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi)
[14:26:39] <topranks>	 Ok yeah looks like it will thanks.
[14:26:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[14:26:52] <wikibugs>	 (03PS2) 10Gehel: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson)
[14:28:05] <wikibugs>	 (03PS3) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363
[14:29:32] <wikibugs>	 (03PS5) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[14:30:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[14:30:32] <godog>	 topranks: just for expectation' sake, do you have an approximate ETA for the fix ? I'm asking because unfortunately icinga config invalid blocks all other changes to its config too :(
[14:31:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff)
[14:32:15] <XioNoX>	 godog: can that host be forced out of icinga for now?
[14:32:57] <topranks>	 I wasn't 100% sure what to do with it.  The reimage failed anyway, or looks like it will.
[14:33:09] <topranks>	 [54/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for elastic1093.eqiad.wmnet
[14:33:19] <topranks>	 ^^ current state.
[14:33:39] <topranks>	 So maybe I can just cancel and run decommission and then re-try again once switches have been added to mgmt?
[14:33:55] <volans>	 topranks: have you checked the console? with install_console
[14:34:15] <topranks>	 no I can certainly try that
[14:34:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21165 and previous config saved to /var/cache/conftool/dbconfig/20220221-143435-root.json
[14:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:49] <volans>	 that's the reboot *after* the first puppet run
[14:34:53] <volans>	 so the host should come back
[14:35:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33873/" [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi)
[14:35:23] <godog>	 XioNoX: mmhh not as selectively let's say, but if it is e.g. deactivated from puppetdb it won't show up in icinga
[14:35:36] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[14:36:40] <topranks>	 volans: host did reboot alright, console is sitting at login prompt.
[14:36:58] <topranks>	 when I try to run "install_console" I'm getting prompted for a pw though
[14:37:02] <topranks>	 https://www.irccloud.com/pastebin/JAeatIDm/
[14:37:08] <topranks>	 Is that normal?
[14:37:27] <volans>	 depends, if puppet run successfully yes
[14:37:32] <volans>	 the key gets removed
[14:37:50] <topranks>	 ah ok yeah, root pw worked fine yep.
[14:37:54] <volans>	 is valid only during the first installation
[14:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21166 and previous config saved to /var/cache/conftool/dbconfig/20220221-143841-root.json
[14:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:21] <volans>	 topranks: cumin can't reach it
[14:39:30] <topranks>	 yeah it's trying the v6 address and failing
[14:39:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21167 and previous config saved to /var/cache/conftool/dbconfig/20220221-143931-marostegui.json
[14:39:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[14:39:34] <volans>	 or actually is very slow
[14:39:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[14:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:37] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[14:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:46] <topranks>	 https://www.irccloud.com/pastebin/KaX4i6Tu/
[14:40:13] <volans>	 why is v6 not working?
[14:40:26] <topranks>	 The v6 address is not configured on the host, why I do not know.
[14:40:28] <volans>	 my understanding is that the check for uptime is timing out
[14:41:14] <topranks>	 yeah from the error message that's what it looks like
[14:41:23] <topranks>	 I assume cose of this v6 thing.
[14:41:42] <topranks>	 Device's v6 IP is not defined in /etc/network/interfaces
[14:41:49] <wikibugs>	 (03PS6) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[14:42:01] <topranks>	 It is configured to add a link local: 	up ip addr add fe80::10:64:132:2/64 dev enp59s0f0np0
[14:42:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[14:42:45] <volans>	 it should have https://netbox.wikimedia.org/ipam/ip-addresses/10174/ too
[14:43:27] <topranks>	 Ok.  So the debian installer didn't create the "interfaces" file right for some reason.
[14:44:10] <volans>	 apparently so
[14:45:22] <volans>	 the timeout is set to 10s that for cat /proc/uptime seems an eternity
[14:45:28] <volans>	 yet it manages to trigger it
[14:45:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946)
[14:45:47] <volans>	 from one side actually good so we did notice the issue
[14:46:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:46:43] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.865e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:46:45] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:47:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[14:47:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[14:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21168 and previous config saved to /var/cache/conftool/dbconfig/20220221-144707-marostegui.json
[14:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:15] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[14:47:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946)
[14:47:37] <topranks>	 Ah I may know what's up.
[14:48:17] <topranks>	 Our IPv6 allocation on servers is dependent on the device already having gotten an address using SLAAC?
[14:49:09] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:49:13] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[14:49:20] <topranks>	 Which we then extract the prefix from and set the v6.
[14:49:32] <topranks>	 Right that's an issue - I didn't have these new switches set up to do SLAAC.
[14:49:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21169 and previous config saved to /var/cache/conftool/dbconfig/20220221-144938-root.json
[14:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33878/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[14:52:02] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye
[14:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed...
[14:53:38] <volans>	 topranks: ack
[14:53:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21170 and previous config saved to /var/cache/conftool/dbconfig/20220221-145345-root.json
[14:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:01] <volans>	 if you need to run the reimage you can just re-run it removing the --new option, no need for decom
[14:54:26] <topranks>	 ah ok good tip thanks.
[14:55:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21171 and previous config saved to /var/cache/conftool/dbconfig/20220221-145556-marostegui.json
[14:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:02] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[15:00:05] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21172 and previous config saved to /var/cache/conftool/dbconfig/20220221-150004-kormat.json
[15:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:10] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:00:59] <wikibugs>	 (03PS7) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[15:01:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Hannah Okwelum - https://phabricator.wikimedia.org/T302212 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, I've done this. Regards, Matthew
[15:01:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[15:03:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: more replicas, less CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[15:03:39] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.678e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:03:43] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:04:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21173 and previous config saved to /var/cache/conftool/dbconfig/20220221-150442-root.json
[15:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:05] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:06:09] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:06:27] <wikibugs>	 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) a:03Joe
[15:06:54] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: more replicas, less CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[15:07:23] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10MatthewVernon) @Zabe can I confirm you've been in touch with legal directly?
[15:08:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21174 and previous config saved to /var/cache/conftool/dbconfig/20220221-150848-root.json
[15:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[15:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:36] <wikibugs>	 (03PS1) 10Elukey: Add new k8s partman recipe to ml-serve[12]00[1-4] nodes [puppet] - 10https://gerrit.wikimedia.org/r/764374
[15:10:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[15:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21175 and previous config saved to /var/cache/conftool/dbconfig/20220221-151101-marostegui.json
[15:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:17] <wikibugs>	 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) I think this is the old etcd certificate we used to use for etcd in codfw; since we've moved to etcd v3 we're using a new cert created with cergen:  ` $ openssl s_client -host conf2004.codfw.wmnet -p...
[15:14:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add new k8s partman recipe to ml-serve[12]00[1-4] nodes [puppet] - 10https://gerrit.wikimedia.org/r/764374 (owner: 10Elukey)
[15:15:09] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21176 and previous config saved to /var/cache/conftool/dbconfig/20220221-151509-kormat.json
[15:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:51] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946)
[15:19:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21177 and previous config saved to /var/cache/conftool/dbconfig/20220221-151945-root.json
[15:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:39] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:21:29] <wikibugs>	 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) >>! In T298991#7725178, @MatthewVernon wrote: > @AndyRussG I'm making this "Stalled", and "Low" priority for now, since I think really you a...
[15:21:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33882/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:23:05] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:24:12] <wikibugs>	 (03PS4) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946)
[15:24:43] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings for ml-serve2001 [puppet] - 10https://gerrit.wikimedia.org/r/764376
[15:24:45] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings for ml-serve2002 [puppet] - 10https://gerrit.wikimedia.org/r/764377
[15:24:47] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings for ml-serve2003 [puppet] - 10https://gerrit.wikimedia.org/r/764378
[15:24:49] <wikibugs>	 (03PS1) 10Elukey: Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379
[15:25:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:25:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[15:26:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21178 and previous config saved to /var/cache/conftool/dbconfig/20220221-152606-marostegui.json
[15:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add *.k8s-staging.discovery.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm)
[15:26:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2001 [puppet] - 10https://gerrit.wikimedia.org/r/764376 (owner: 10Elukey)
[15:28:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm)
[15:28:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2001.codfw.wmnet with OS bullseye
[15:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:44] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005)
[15:30:14] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21179 and previous config saved to /var/cache/conftool/dbconfig/20220221-153013-kormat.json
[15:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:50] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9]: Deploy Aqs Hourly for Airflow [analytics/refinery@ed5c9f9]
[15:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:16] <wikibugs>	 (03CR) 10JMeybohm: "@Brandon: Would you mind taking a look if that's something you think is okay to do?" [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm)
[15:34:37] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:35:09] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:35:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[15:35:51] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) >>! In T302163#7725826, @MatthewVernon wrote: > @Zabe can I confirm you've been in touch with legal directly?  Yes, I have sent an email to leg...
[15:39:57] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:41:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21180 and previous config saved to /var/cache/conftool/dbconfig/20220221-154110-marostegui.json
[15:41:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[15:41:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[15:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:18] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[15:41:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21181 and previous config saved to /var/cache/conftool/dbconfig/20220221-154118-marostegui.json
[15:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:44:02] <wikibugs>	 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10TheresNoTime)
[15:45:19] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21182 and previous config saved to /var/cache/conftool/dbconfig/20220221-154518-kormat.json
[15:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:24] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[15:45:24] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:45:25] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage
[15:45:25] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[15:45:26] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[15:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:34] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[15:45:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage
[15:47:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Lucas_Werkmeister_WMDE) I support this access request, and will be happy to provide assistance to @TheresNoTime if needed. 👍
[15:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:34] <wikibugs>	 (03PS5) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946)
[15:50:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21183 and previous config saved to /var/cache/conftool/dbconfig/20220221-155034-marostegui.json
[15:50:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:40] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[15:51:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33886/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:52:05] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:52:13] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9]: Deploy Aqs Hourly for Airflow [analytics/refinery@ed5c9f9] (duration: 21m 23s)
[15:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:30] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[15:58:45] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:19] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:59:20] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:25] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21184 and previous config saved to /var/cache/conftool/dbconfig/20220221-155924-kormat.json
[15:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:34] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[15:59:49] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:00:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:01:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33887/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[16:01:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2001.codfw.wmnet with OS bullseye
[16:01:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[16:01:53] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye
[16:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye
[16:03:15] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=ml-serve,service=kubesvc
[16:03:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:28] <elukey>	 mmmm
[16:04:21] <elukey>	 ah ml_serve uff
[16:04:35] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=ml_serve,service=kubesvc
[16:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:14] <godog>	 ah the hostname is ml-serve but the cluster is ml_serve ? /o\
[16:05:34] <godog>	 can we fix it or it is already too late?
[16:05:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21185 and previous config saved to /var/cache/conftool/dbconfig/20220221-160538-marostegui.json
[16:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:45] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve200[5-8].codfw.wmnet
[16:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:28] <elukey>	 godog: I don't recall why it was done in that way
[16:07:10] <godog>	 yeah IIRC there's nothing wrong with dashes in cluster name
[16:07:39] <godog>	 maybe I'm misremembering though
[16:08:08] <godog>	 mmhh no should be fine, we have wqds-test for example
[16:08:38] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21186 and previous config saved to /var/cache/conftool/dbconfig/20220221-160838-kormat.json
[16:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:45] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[16:09:23] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[16:09:42] <wikibugs>	 (03Abandoned) 10Hashar: Remove bot humors for deployers [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734964 (owner: 10Hashar)
[16:10:04] <elukey>	 godog: I think that it was named after the cluster in wikimedia_clusters
[16:10:09] <elukey>	 that is named ml_serve
[16:10:10] <elukey>	 mmmm
[16:10:20] <elukey>	 so we should rename the conftool config right?
[16:10:27] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[16:10:47] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[16:11:20] <godog>	 elukey: I believe so, the conftool bits and also the "cluster" variable in wikimedia_clusters I think would be nice if they matched
[16:11:52] <godog>	 not the end of the world all things considered but one of those friction points that compounds
[16:12:22] <elukey>	 I see it can be possible to use - and _
[16:12:39] <elukey>	 so they are currently matching though, ml_serve
[16:12:48] <elukey>	 the first conftool action I think was a no-op 
[16:12:50] <elukey>	 (my bad)
[16:14:02] <godog>	 yeah they are consistent between each other but not with the hostnames ml-serve
[16:14:13] <elukey>	 okok
[16:14:36] <elukey>	 I can try to work on it, hope that it will not break too many things
[16:15:30] <godog>	 if you can/want I think it'll pay off, if not that's fine too
[16:15:42] <godog>	 I don't want to jinx but I think most/all things should DTRT
[16:15:59] <elukey>	 okok I'll try, the scary part is pybal but hopefully it should work
[16:16:17] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2002 [puppet] - 10https://gerrit.wikimedia.org/r/764377 (owner: 10Elukey)
[16:16:21] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956)
[16:17:03] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage
[16:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große)
[16:18:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bullseye
[16:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21187 and previous config saved to /var/cache/conftool/dbconfig/20220221-162043-marostegui.json
[16:20:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:30] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage
[16:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:43] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21188 and previous config saved to /var/cache/conftool/dbconfig/20220221-162342-kormat.json
[16:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:15] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:24:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:25:27] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:27:03] <icinga-wm>	 RECOVERY - ganeti-confd running on ganeti1005 is OK: PROCS OK: 1 process with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[16:27:27] <icinga-wm>	 RECOVERY - ganeti-mond running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[16:28:17] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[16:30:49] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1093.eqiad.wmnet with OS bullseye
[16:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye complete...
[16:31:06] <volans>	 topranks: yay it worked this time :D
[16:31:31] <topranks>	 haha... still looking at the logs afraid to say that :)
[16:31:37] <topranks>	 But yes appears to have worked fine :)
[16:31:39] <topranks>	 woohoo!
[16:32:02] <volans>	 :)
[16:34:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:34:44] <godog>	 very nice!
[16:34:51] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[16:34:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:35:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage
[16:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21189 and previous config saved to /var/cache/conftool/dbconfig/20220221-163548-marostegui.json
[16:35:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[16:35:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[16:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:54] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[16:35:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21190 and previous config saved to /var/cache/conftool/dbconfig/20220221-163555-marostegui.json
[16:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:47] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.592e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:36:51] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:36:56] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9] (thin): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9]
[16:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:03] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9] (thin): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] (duration: 00m 07s)
[16:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:37] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9] (hadoop-test): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9]
[16:37:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:13] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage
[16:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:38] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[16:38:48] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21191 and previous config saved to /var/cache/conftool/dbconfig/20220221-163847-kormat.json
[16:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:15] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:39:16] <wikibugs>	 (03CR) 10Klausman: "Does this supersede the other change? It only edits the 2003 yaml file which is deleted here." [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey)
[16:39:19] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:39:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:40:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:41:41] <wikibugs>	 (03CR) 10Elukey: Add overlayfs settings to ml-serve2004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey)
[16:43:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey)
[16:43:19] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33888/console" [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey)
[16:43:58] <wikibugs>	 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite)
[16:44:34] <wikibugs>	 (03PS8) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[16:44:49] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9] (hadoop-test): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] (duration: 07m 12s)
[16:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[16:46:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21192 and previous config saved to /var/cache/conftool/dbconfig/20220221-164608-marostegui.json
[16:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:15] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[16:47:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[16:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[16:48:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:49] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:50:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[16:50:42] <wikibugs>	 (03PS9) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[16:50:59] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:51:42] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2002.codfw.wmnet with OS bullseye
[16:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[16:53:52] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21193 and previous config saved to /var/cache/conftool/dbconfig/20220221-165352-kormat.json
[16:53:53] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:53:54] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[16:53:56] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[16:53:57] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:54:01] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:02] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[16:54:05] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21194 and previous config saved to /var/cache/conftool/dbconfig/20220221-165405-kormat.json
[16:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:17] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21195 and previous config saved to /var/cache/conftool/dbconfig/20220221-165616-kormat.json
[16:56:21] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[16:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2003 [puppet] - 10https://gerrit.wikimedia.org/r/764378 (owner: 10Elukey)
[16:59:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2003.codfw.wmnet with OS bullseye
[16:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21196 and previous config saved to /var/cache/conftool/dbconfig/20220221-170113-marostegui.json
[17:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:30] <wikibugs>	 (03PS2) 10Andrew Bogott: nfs add_server: disable nfs mounts for new nfs servers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/763955
[17:02:40] <wikibugs>	 (03PS2) 10Elukey: Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379
[17:02:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10JMeybohm)
[17:02:47] <wikibugs>	 (03PS10) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[17:03:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:03:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33892/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:05:33] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:06:25] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@f1244e0]: Migrate aqs/hourly from Oozie|Hive to Airflow|Spark
[17:06:26] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:33] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@f1244e0]: Migrate aqs/hourly from Oozie|Hive to Airflow|Spark (duration: 00m 07s)
[17:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:47] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:07:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/763805 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott)
[17:07:19] <wikibugs>	 (03PS2) 10Andrew Bogott: Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/763805 (https://phabricator.wikimedia.org/T281276)
[17:08:09] <wikibugs>	 (03PS11) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[17:09:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:09:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33893/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:10:17] <wikibugs>	 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7722815, @MSantos wrote: > @akosiaris and @jijiki how can we move forward with this? >  > For context:  > - [[...
[17:10:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway)
[17:11:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21197 and previous config saved to /var/cache/conftool/dbconfig/20220221-171121-kormat.json
[17:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:41] <wikibugs>	 (03PS12) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[17:14:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33894/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:14:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:16:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21198 and previous config saved to /var/cache/conftool/dbconfig/20220221-171618-marostegui.json
[17:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Ladsgroup) I also support this request, TNT had production access before and trusted and has been instrumental in lot of work in incidents and any area possible. So much <3 for her.
[17:16:26] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:16:49] <elukey>	 this is me reimaging --^
[17:16:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage
[17:16:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:58] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage
[17:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005" [dns] - 10https://gerrit.wikimedia.org/r/764421
[17:20:35] <wikibugs>	 (03PS10) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758)
[17:20:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[17:21:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:22:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:26:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1003.wikimedia.org with OS bullseye
[17:26:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:26] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21199 and previous config saved to /var/cache/conftool/dbconfig/20220221-172626-kormat.json
[17:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:22] <wikibugs>	 (03PS13) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[17:30:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:30:41] <wikibugs>	 (03PS14) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[17:31:09] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:31:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21200 and previous config saved to /var/cache/conftool/dbconfig/20220221-173122-marostegui.json
[17:31:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[17:31:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[17:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:28] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[17:31:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21201 and previous config saved to /var/cache/conftool/dbconfig/20220221-173130-marostegui.json
[17:31:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:31:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33895/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[17:32:21] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:32:45] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@c2fdce7]: fix aqs hourly DAGs start date
[17:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:52] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@c2fdce7]: fix aqs hourly DAGs start date (duration: 00m 07s)
[17:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:33:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2003.codfw.wmnet with OS bullseye
[17:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey)
[17:38:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bullseye
[17:38:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:31] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21202 and previous config saved to /var/cache/conftool/dbconfig/20220221-174130-kormat.json
[17:41:32] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[17:41:34] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[17:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:37] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[17:41:38] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21203 and previous config saved to /var/cache/conftool/dbconfig/20220221-174138-kormat.json
[17:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21204 and previous config saved to /var/cache/conftool/dbconfig/20220221-174335-marostegui.json
[17:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:41] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[17:44:17] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:44:56] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[17:45:27] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:46:16] <wikibugs>	 (03CR) 10Jbond: "lgtm but see nits" [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway)
[17:46:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/763856 (owner: 10JHathaway)
[17:47:45] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:47:51] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21205 and previous config saved to /var/cache/conftool/dbconfig/20220221-174750-kormat.json
[17:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:57] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[17:50:13] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@17a70a0]: fix missing extra_query_parameters
[17:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:20] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@17a70a0]: fix missing extra_query_parameters (duration: 00m 07s)
[17:50:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage
[17:55:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage
[17:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21206 and previous config saved to /var/cache/conftool/dbconfig/20220221-175839-marostegui.json
[17:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[18:01:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[18:01:18] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ml-serve2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.11. Check system logs on 10.192.48.11 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T302240 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:01:22] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ml-serve2004 - https://phabricator.wikimedia.org/T302240 (10ops-monitoring-bot)
[18:02:00] <elukey>	 what
[18:02:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[18:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:27] <elukey>	 mmm ok now it is a cornercase
[18:02:27] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[18:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:56] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21207 and previous config saved to /var/cache/conftool/dbconfig/20220221-180255-kormat.json
[18:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1003.wikimedia.org with reason: host reimage
[18:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:11] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[18:07:11] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[18:07:34] <wikibugs>	 (03PS15) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:07:38] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1003.wikimedia.org with reason: host reimage
[18:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:08:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33896/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:09:50] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:09:56] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org
[18:11:09] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:11:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2004.codfw.wmnet with OS bullseye
[18:12:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21208 and previous config saved to /var/cache/conftool/dbconfig/20220221-181344-marostegui.json
[18:13:45] <wikibugs>	 (03PS16) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:14:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33897/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:17:28] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ml-serve2004 - https://phabricator.wikimedia.org/T302240 (10elukey) 05Open→03Invalid Node being reimaged.
[18:18:00] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21209 and previous config saved to /var/cache/conftool/dbconfig/20220221-181800-kormat.json
[18:18:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:56] <wikibugs>	 (03PS17) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:21:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:22:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33898/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:22:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Set profile::openstack::XXX::keystone::wsgi_server to 'keystone' everywhere [puppet] - 10https://gerrit.wikimedia.org/r/764430 (https://phabricator.wikimedia.org/T281276)
[18:24:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Set profile::openstack::XXX::keystone::wsgi_server to 'keystone' everywhere [puppet] - 10https://gerrit.wikimedia.org/r/764430 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott)
[18:25:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33900/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:28:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21210 and previous config saved to /var/cache/conftool/dbconfig/20220221-182849-marostegui.json
[18:28:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:28:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:56] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[18:28:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21211 and previous config saved to /var/cache/conftool/dbconfig/20220221-182856-marostegui.json
[18:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:49] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[18:32:13] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[18:33:04] <urbanecm>	 !log Password reset for Jrnka ka@SUL per Ticket#2022022010002692
[18:33:05] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21212 and previous config saved to /var/cache/conftool/dbconfig/20220221-183304-kormat.json
[18:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:14] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[18:35:36] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[18:36:31] <wikibugs>	 (03PS18) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:37:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:37:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33901/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:37:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21213 and previous config saved to /var/cache/conftool/dbconfig/20220221-183751-marostegui.json
[18:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:57] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[18:39:15] <wikibugs>	 (03CR) 10Urbanecm: "code looks good, but I'd appreciate Daimona's opinion here, as they're one of the AF experts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji)
[18:40:26] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C: 03+1] "Seems fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji)
[18:52:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21214 and previous config saved to /var/cache/conftool/dbconfig/20220221-185256-marostegui.json
[18:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:05] <wikibugs>	 (03PS19) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:55:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[18:59:10] <wikibugs>	 (03PS20) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[18:59:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:00:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33903/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:03:42] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1003.wikimedia.org with OS bullseye
[19:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33904/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:08:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21215 and previous config saved to /var/cache/conftool/dbconfig/20220221-190801-marostegui.json
[19:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:16] <wikibugs>	 (03PS21) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[19:10:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:10:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33905/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:13:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:14:13] <wikibugs>	 (03PS22) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[19:15:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:15:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33906/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:16:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:23:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21216 and previous config saved to /var/cache/conftool/dbconfig/20220221-192309-marostegui.json
[19:23:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:23:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:19] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[19:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:33] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[19:25:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[19:27:55] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[19:28:16] <wikibugs>	 (03PS23) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[19:28:18] <wikibugs>	 (03PS1) 10Jbond: O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438
[19:28:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:30:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33907/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:30:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:30:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[19:30:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance
[19:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance
[19:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:38] <wikibugs>	 (03PS24) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330
[19:34:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33908/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:38:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[19:38:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[19:38:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21217 and previous config saved to /var/cache/conftool/dbconfig/20220221-193842-marostegui.json
[19:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:53] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[19:40:07] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[19:44:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "most recent pcc is essentially a no-op" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:44:59] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[19:46:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "Also need to rename hiera keys in the private repo simlar to" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond)
[19:50:19] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:51:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21218 and previous config saved to /var/cache/conftool/dbconfig/20220221-195147-marostegui.json
[19:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:53] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[19:56:43] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[19:59:00] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[20:06:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21219 and previous config saved to /var/cache/conftool/dbconfig/20220221-200651-marostegui.json
[20:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21220 and previous config saved to /var/cache/conftool/dbconfig/20220221-202156-marostegui.json
[20:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21221 and previous config saved to /var/cache/conftool/dbconfig/20220221-203701-marostegui.json
[20:37:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[20:37:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[20:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:08] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[20:37:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21222 and previous config saved to /var/cache/conftool/dbconfig/20220221-203708-marostegui.json
[20:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:54] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Don't suppress teardown prompt when pressing escape [extensions/VisualEditor] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764396 (https://phabricator.wikimedia.org/T302096)
[20:48:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21223 and previous config saved to /var/cache/conftool/dbconfig/20220221-204849-marostegui.json
[20:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:56] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[20:59:38] <RhinosF1>	 jouncebot: next
[20:59:39] <jouncebot>	 In 11 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T0800)
[21:03:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21224 and previous config saved to /var/cache/conftool/dbconfig/20220221-210354-marostegui.json
[21:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21225 and previous config saved to /var/cache/conftool/dbconfig/20220221-211859-marostegui.json
[21:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:50] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:32:25] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:34:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21226 and previous config saved to /var/cache/conftool/dbconfig/20220221-213403-marostegui.json
[21:34:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[21:34:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[21:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:11] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[21:34:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21227 and previous config saved to /var/cache/conftool/dbconfig/20220221-213411-marostegui.json
[21:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:37:17] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 142 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:41:40] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:41:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:42:41] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:45:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21228 and previous config saved to /var/cache/conftool/dbconfig/20220221-214500-marostegui.json
[21:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:07] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[21:49:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:51:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:52:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:54:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:58:13] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:59:21] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:00:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21229 and previous config saved to /var/cache/conftool/dbconfig/20220221-220005-marostegui.json
[22:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21230 and previous config saved to /var/cache/conftool/dbconfig/20220221-221510-marostegui.json
[22:15:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:37] <icinga-wm>	 PROBLEM - Check systemd state on durum6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service,ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:03] <icinga-wm>	 PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:37] <icinga-wm>	 PROBLEM - Check systemd state on durum6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:18:19] <icinga-wm>	 PROBLEM - Check systemd state on doh6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:28:32] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664) (owner: 10Hnowlan)
[22:29:15] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[22:30:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21231 and previous config saved to /var/cache/conftool/dbconfig/20220221-223015-marostegui.json
[22:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:22] <stashbot>	 T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381
[22:34:11] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[22:35:36] <jinxer-wm>	 (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85%   - https://alerts.wikimedia.org
[22:48:57] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[22:51:23] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[23:06:09] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[23:11:03] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010
[23:25:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[23:47:34] <wikibugs>	 (03CR) 10Huji: "Thanks @Daimona." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji)
[23:55:53] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook