[00:03:36] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:16:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:16:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21053 and previous config saved to /var/cache/conftool/dbconfig/20220221-001641-ladsgroup.json [00:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:49] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [00:36:32] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:41:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21054 and previous config saved to /var/cache/conftool/dbconfig/20220221-004128-ladsgroup.json [00:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:34] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [00:56:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21055 and previous config saved to /var/cache/conftool/dbconfig/20220221-005632-ladsgroup.json [00:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P21056 and previous config saved to /var/cache/conftool/dbconfig/20220221-011137-ladsgroup.json [01:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:48] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:20:04] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21057 and previous config saved to /var/cache/conftool/dbconfig/20220221-012642-ladsgroup.json [01:26:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [01:26:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [01:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:49] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [01:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21058 and previous config saved to /var/cache/conftool/dbconfig/20220221-012649-ladsgroup.json [01:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21059 and previous config saved to /var/cache/conftool/dbconfig/20220221-013429-ladsgroup.json [01:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:35] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [01:38:00] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:38:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [01:38:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [01:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T302185)', diff saved to https://phabricator.wikimedia.org/P21060 and previous config saved to /var/cache/conftool/dbconfig/20220221-013811-ladsgroup.json [01:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:18] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [01:39:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2152.codfw.wmnet with OS bullseye [01:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:23] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:46:36] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21061 and previous config saved to /var/cache/conftool/dbconfig/20220221-014934-ladsgroup.json [01:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2152.codfw.wmnet with reason: host reimage [01:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: host reimage [01:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P21062 and previous config saved to /var/cache/conftool/dbconfig/20220221-020438-ladsgroup.json [02:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2152.codfw.wmnet with OS bullseye [02:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:32] (03PS1) 10Ladsgroup: Add add_linter_template_T300402.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300402) [02:16:17] (03PS2) 10Ladsgroup: Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) [02:19:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298554)', diff saved to https://phabricator.wikimedia.org/P21063 and previous config saved to /var/cache/conftool/dbconfig/20220221-021943-ladsgroup.json [02:19:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:19:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:52] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:02] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T302185)', diff saved to https://phabricator.wikimedia.org/P21064 and previous config saved to /var/cache/conftool/dbconfig/20220221-022259-ladsgroup.json [02:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:05] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [02:31:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2091.codfw.wmnet with reason: Maintenance [02:31:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2091.codfw.wmnet with reason: Maintenance [02:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2091 (T302185)', diff saved to https://phabricator.wikimedia.org/P21065 and previous config saved to /var/cache/conftool/dbconfig/20220221-023158-ladsgroup.json [02:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:04] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [02:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [02:33:37] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:34:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2091.codfw.wmnet with OS bullseye [02:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [02:38:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [02:38:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21066 and previous config saved to /var/cache/conftool/dbconfig/20220221-023852-ladsgroup.json [02:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:01] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:49:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2091.codfw.wmnet with reason: host reimage [02:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2091.codfw.wmnet with reason: host reimage [02:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21067 and previous config saved to /var/cache/conftool/dbconfig/20220221-025534-ladsgroup.json [02:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:40] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:08:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2091.codfw.wmnet with OS bullseye [03:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21068 and previous config saved to /var/cache/conftool/dbconfig/20220221-031039-ladsgroup.json [03:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2091 (T302185)', diff saved to https://phabricator.wikimedia.org/P21069 and previous config saved to /var/cache/conftool/dbconfig/20220221-031602-ladsgroup.json [03:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:09] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:25:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Maintenance [03:25:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2084.codfw.wmnet with reason: Maintenance [03:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P21070 and previous config saved to /var/cache/conftool/dbconfig/20220221-032548-ladsgroup.json [03:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2084 (T302185)', diff saved to https://phabricator.wikimedia.org/P21071 and previous config saved to /var/cache/conftool/dbconfig/20220221-032548-ladsgroup.json [03:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:00] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [03:28:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2084.codfw.wmnet with OS bullseye [03:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2084.codfw.wmnet with reason: host reimage [03:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298554)', diff saved to https://phabricator.wikimedia.org/P21072 and previous config saved to /var/cache/conftool/dbconfig/20220221-034052-ladsgroup.json [03:40:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:40:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:58] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21073 and previous config saved to /var/cache/conftool/dbconfig/20220221-034100-ladsgroup.json [03:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2084.codfw.wmnet with reason: host reimage [03:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:01] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:47:09] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21074 and previous config saved to /var/cache/conftool/dbconfig/20220221-034836-ladsgroup.json [03:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:43] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [03:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2084.codfw.wmnet with OS bullseye [03:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21075 and previous config saved to /var/cache/conftool/dbconfig/20220221-040341-ladsgroup.json [04:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2084 (T302185)', diff saved to https://phabricator.wikimedia.org/P21076 and previous config saved to /var/cache/conftool/dbconfig/20220221-041123-ladsgroup.json [04:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:30] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [04:15:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2083.codfw.wmnet with reason: Maintenance [04:15:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2083.codfw.wmnet with reason: Maintenance [04:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2083 (T302185)', diff saved to https://phabricator.wikimedia.org/P21077 and previous config saved to /var/cache/conftool/dbconfig/20220221-041529-ladsgroup.json [04:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2083.codfw.wmnet with OS bullseye [04:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P21078 and previous config saved to /var/cache/conftool/dbconfig/20220221-041846-ladsgroup.json [04:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2083.codfw.wmnet with reason: host reimage [04:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298554)', diff saved to https://phabricator.wikimedia.org/P21079 and previous config saved to /var/cache/conftool/dbconfig/20220221-043350-ladsgroup.json [04:33:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:33:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:56] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [04:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21080 and previous config saved to /var/cache/conftool/dbconfig/20220221-043358-ladsgroup.json [04:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2083.codfw.wmnet with reason: host reimage [04:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:26] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:39:32] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:48:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2083.codfw.wmnet with OS bullseye [04:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2083 (T302185)', diff saved to https://phabricator.wikimedia.org/P21081 and previous config saved to /var/cache/conftool/dbconfig/20220221-045516-ladsgroup.json [04:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:00:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21082 and previous config saved to /var/cache/conftool/dbconfig/20220221-050050-ladsgroup.json [05:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:56] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [05:15:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21083 and previous config saved to /var/cache/conftool/dbconfig/20220221-051555-ladsgroup.json [05:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P21084 and previous config saved to /var/cache/conftool/dbconfig/20220221-053059-ladsgroup.json [05:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:46:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298554)', diff saved to https://phabricator.wikimedia.org/P21085 and previous config saved to /var/cache/conftool/dbconfig/20220221-054604-ladsgroup.json [05:46:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:46:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:11] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [05:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21086 and previous config saved to /var/cache/conftool/dbconfig/20220221-054612-ladsgroup.json [05:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:13] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10Marostegui) p:05Triage→03Medium Disk #3 is gone: ` # megacli -PDList -aALL | grep Slot Slot Number: 0 Slot Number: 1 Slot Number: 2 Slot Number: 4 Slot Number: 5 Slot Number: 6 Slot Number: 7 Slot Number: 8... [06:07:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21087 and previous config saved to /var/cache/conftool/dbconfig/20220221-060701-ladsgroup.json [06:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:08] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:07:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:08:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300775)', diff saved to https://phabricator.wikimedia.org/P21088 and previous config saved to /var/cache/conftool/dbconfig/20220221-060804-marostegui.json [06:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:11] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:11:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:12:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21089 and previous config saved to /var/cache/conftool/dbconfig/20220221-061205-marostegui.json [06:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:12] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:13:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21090 and previous config saved to /var/cache/conftool/dbconfig/20220221-061719-marostegui.json [06:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:26] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:18:43] (03PS1) 10Marostegui: db1107: Move from m3 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/764111 (https://phabricator.wikimedia.org/T301654) [06:20:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1107.eqiad.wmnet with OS bullseye [06:20:17] (03CR) 10Marostegui: [C: 03+2] db1107: Move from m3 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/764111 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui) [06:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21091 and previous config saved to /var/cache/conftool/dbconfig/20220221-062206-ladsgroup.json [06:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1107.eqiad.wmnet with reason: host reimage [06:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1107.eqiad.wmnet with reason: host reimage [06:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21092 and previous config saved to /var/cache/conftool/dbconfig/20220221-063223-marostegui.json [06:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:53] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [06:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P21093 and previous config saved to /var/cache/conftool/dbconfig/20220221-063713-ladsgroup.json [06:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:34] (03CR) 10Marostegui: [C: 03+1] Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup) [06:41:46] !log Stop mysql on db1117:3325 to clone db1107 - T301654 [06:41:46] T301654: Upgrade m5 to Bullseye - https://phabricator.wikimedia.org/T301654 [06:46:03] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:46:25] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:46:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1107.eqiad.wmnet with OS bullseye [06:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:24] haproxy alerts are expected [06:47:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P21095 and previous config saved to /var/cache/conftool/dbconfig/20220221-064728-marostegui.json [06:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:49] (03CR) 10Ladsgroup: [C: 03+2] Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup) [06:48:05] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T302169 (10wiki_willy) a:03Cmjohnson [06:48:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:30] 10SRE, 10ops-eqiad, 10DC-Ops: cloudvirt1017.mgmt/SSH - https://phabricator.wikimedia.org/T302016 (10wiki_willy) a:03Cmjohnson [06:48:44] (03Merged) 10jenkins-bot: Add add_linter_template_T300992.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763986 (https://phabricator.wikimedia.org/T300992) (owner: 10Ladsgroup) [06:49:09] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:49:33] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:50:25] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:34] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10wiki_willy) Hi @ayounsi - I'm not sure if you're copied on the Interxion ticket, so just forwarding the info along that they completed th... [06:52:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298554)', diff saved to https://phabricator.wikimedia.org/P21096 and previous config saved to /var/cache/conftool/dbconfig/20220221-065220-ladsgroup.json [06:52:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:52:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:26] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:53:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:27] (03CR) 10Elukey: install_server: set new partman recipe for kubestage1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:01:46] (03CR) 10Elukey: [C: 03+1] Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:02:12] (03PS2) 10Elukey: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:02:32] (03CR) 10jerkins-bot: [V: 04-1] install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T300381)', diff saved to https://phabricator.wikimedia.org/P21097 and previous config saved to /var/cache/conftool/dbconfig/20220221-070233-marostegui.json [07:02:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:02:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:39] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:02:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21098 and previous config saved to /var/cache/conftool/dbconfig/20220221-070240-marostegui.json [07:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:41] (03PS3) 10Elukey: install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:04:43] (03PS2) 10Elukey: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:07:34] (03CR) 10Elukey: [C: 03+2] ml-services: add etwiki and fawiki editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/763773 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [07:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21099 and previous config saved to /var/cache/conftool/dbconfig/20220221-070822-marostegui.json [07:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:29] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:08:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:08:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:09:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [07:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [07:11:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [07:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:15:59] (03CR) 10Elukey: [C: 03+2] install_server: set new partman recipe for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763756 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [07:18:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:20:45] (03PS1) 10Elukey: install_server: move kubestage[12]* nodes to overlayfs partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/764291 (https://phabricator.wikimedia.org/T300744) [07:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21100 and previous config saved to /var/cache/conftool/dbconfig/20220221-072326-marostegui.json [07:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:30:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [07:30:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [07:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [07:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:11] RECOVERY - Host asw1-b13-drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 85.47 ms [07:34:11] RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 85.63 ms [07:34:11] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 92.12 ms [07:34:29] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.drmrs.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [07:34:43] RECOVERY - Recursive DNS on 2a02:ec80:600:2:185:15:58:37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [07:34:47] RECOVERY - Host prometheus6001 is UP: PING OK - Packet loss = 0%, RTA = 86.60 ms [07:34:55] (03PS1) 10Marostegui: dbstore_multiinstance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764292 (https://phabricator.wikimedia.org/T268869) [07:35:09] (03CR) 10Elukey: [C: 03+2] install_server: move kubestage[12]* nodes to overlayfs partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/764291 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:35:23] (JobUnavailable) firing: (22) Reduced availability for job bird in drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [07:35:27] RECOVERY - Maps edge drmrs on upload-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:35:46] (ThanosSidecarBucketOperationsFailed) firing: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [07:36:53] (03CR) 10Marostegui: [C: 03+2] dbstore_multiinstance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764292 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [07:37:27] (03PS1) 10Elukey: Add overlayfs settings to kubestage2002's settings [puppet] - 10https://gerrit.wikimedia.org/r/764293 (https://phabricator.wikimedia.org/T300744) [07:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P21101 and previous config saved to /var/cache/conftool/dbconfig/20220221-073831-marostegui.json [07:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:39] (03PS1) 10Marostegui: misc.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764294 (https://phabricator.wikimedia.org/T268869) [07:39:13] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings to kubestage2002's settings [puppet] - 10https://gerrit.wikimedia.org/r/764293 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:39:54] (03CR) 10Marostegui: [C: 03+2] misc.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/764294 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [07:40:07] elukey: ok to merge your change? [07:40:21] aww, no backport window today? :( [07:40:39] marostegui: <3 [07:40:44] done! [07:40:46] (ThanosSidecarBucketOperationsFailed) resolved: Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [07:43:51] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005895 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:48:38] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye [07:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:54] (03PS1) 10Kevin Bazira: ml-services: add fiwiki & frwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764295 (https://phabricator.wikimedia.org/T301415) [07:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21102 and previous config saved to /var/cache/conftool/dbconfig/20220221-075336-marostegui.json [07:53:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:42] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:53:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:53:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [07:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki [07:57:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:57:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21103 and previous config saved to /var/cache/conftool/dbconfig/20220221-075800-marostegui.json [07:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:59:26] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220221T0800) [08:00:23] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [08:01:44] (03PS1) 10Marostegui: Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763971 [08:02:30] (03CR) 10Marostegui: [C: 03+2] Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763971 (owner: 10Marostegui) [08:02:47] (03PS1) 10Muehlenhoff: Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297 [08:02:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21104 and previous config saved to /var/cache/conftool/dbconfig/20220221-080248-marostegui.json [08:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:53] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:05:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297 (owner: 10Muehlenhoff) [08:05:30] (03CR) 10Elukey: [C: 03+2] ml-services: add fiwiki & frwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/764295 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [08:05:32] (03PS2) 10Muehlenhoff: Remove LDAP access for Delphine [puppet] - 10https://gerrit.wikimedia.org/r/764297 [08:07:19] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [08:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:23] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [08:10:33] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [08:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:45] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [08:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:03] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:13:09] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I can confirm that (1), (2) and (4) are done. However cr2-drmrs is currently fully down (console is dead as well). My guess is... [08:14:03] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:26] (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:14:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21105 and previous config saved to /var/cache/conftool/dbconfig/20220221-081752-marostegui.json [08:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [08:21:11] (03PS1) 10Marostegui: control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299 [08:21:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye [08:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:10] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299 (owner: 10Marostegui) [08:22:44] !log update karma to 0.99 on alert* hosts - T284213 [08:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:49] T284213: Improve AlertManager dashboard - https://phabricator.wikimedia.org/T284213 [08:23:16] (03Merged) 10jenkins-bot: control-mariadb-10.4: Bump version [software] - 10https://gerrit.wikimedia.org/r/764299 (owner: 10Marostegui) [08:30:47] (03CR) 10Giuseppe Lavagetto: k8s: add module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [08:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P21106 and previous config saved to /var/cache/conftool/dbconfig/20220221-083257-marostegui.json [08:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:05] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:38:11] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc [08:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:05] (03PS1) 10Muehlenhoff: Remove access for rhuang-ctr [puppet] - 10https://gerrit.wikimedia.org/r/764302 [08:43:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for rhuang-ctr [puppet] - 10https://gerrit.wikimedia.org/r/764302 (owner: 10Muehlenhoff) [08:48:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21107 and previous config saved to /var/cache/conftool/dbconfig/20220221-084802-marostegui.json [08:48:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:48:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1009.eqiad.wmnet with OS buster [08:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:26] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS buster [08:51:40] (03PS1) 10Marostegui: change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) [08:52:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:52:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:57:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:57:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21108 and previous config saved to /var/cache/conftool/dbconfig/20220221-085745-marostegui.json [08:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:55] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:58:43] (03PS1) 10Filippo Giunchedi: alertmanager: route source=icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/764304 (https://phabricator.wikimedia.org/T300951) [08:59:20] (03CR) 10Ladsgroup: [C: 03+1] change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui) [09:00:17] (03CR) 10Marostegui: [C: 03+2] change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui) [09:00:40] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: remove Icinga/ prefix and add 'source' label [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763459 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:00:57] (03Merged) 10jenkins-bot: change_fa_id_T298294.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/764303 (https://phabricator.wikimedia.org/T298294) (owner: 10Marostegui) [09:01:26] (03PS3) 10Elukey: Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [09:01:39] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add relabels to rule [puppet] - 10https://gerrit.wikimedia.org/r/763697 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:01:54] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: inject 'source' label to alerts [puppet] - 10https://gerrit.wikimedia.org/r/763698 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:02:18] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route source=icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/764304 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [09:02:59] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for kubestage1003 [puppet] - 10https://gerrit.wikimedia.org/r/763757 (https://phabricator.wikimedia.org/T300744) (owner: 10JMeybohm) [09:03:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21109 and previous config saved to /var/cache/conftool/dbconfig/20220221-090305-marostegui.json [09:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:12] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:03:29] elukey: merged your change too [09:03:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage [09:03:35] <3 [09:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1003.eqiad.wmnet with OS bullseye [09:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage [09:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:31] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:10:36] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:14:26] (KubernetesCalicoDown) firing: kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:14:36] this is me --^ [09:15:36] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:16:53] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I gave a call to Tarek: the power cord on cr2 was faulty, but he was able to find 2 spare ones which he will bill on the ticket.... [09:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21110 and previous config saved to /var/cache/conftool/dbconfig/20220221-091809-marostegui.json [09:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [09:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:00] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [09:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:05] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bullseye [09:22:07] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2001.codfw.wmnet with OS bullseye [09:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:26] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bullseye [09:22:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21111 and previous config saved to /var/cache/conftool/dbconfig/20220221-092226-root.json [09:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:37] (03PS16) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [09:24:01] !log deploy prometheus-icinga-exporter 0.19 - T300951 [09:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:06] T300951: Add 'source' tag to icinga and prometheus/thanos alerts - https://phabricator.wikimedia.org/T300951 [09:24:16] (03CR) 10David Caro: [C: 03+2] r_lang::bioc: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:24:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [09:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:37] (03CR) 10David Caro: [C: 03+2] product_analytics: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:25:06] (03CR) 10David Caro: r_lang::bioc: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:26:03] (03CR) 10Hashar: [C: 03+1] "I have rolled back to patchset 14, preallocating data does not seem to speed up disk writes and we use --snapshot so writes are done entir" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [09:27:01] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:27:05] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: pass extinfo-url to icinga-exporter [puppet] - 10https://gerrit.wikimedia.org/r/763457 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi) [09:27:17] (03CR) 10Majavah: [C: 04-1] "minor style nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [09:30:21] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) Thanks for the swift turnarounds on these! [09:33:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P21112 and previous config saved to /var/cache/conftool/dbconfig/20220221-093314-marostegui.json [09:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1009.eqiad.wmnet with OS buster [09:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:47] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS buster completed: - ganeti1009 (**PASS**)... [09:34:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1003.eqiad.wmnet with OS bullseye [09:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:36] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:36:48] (03PS17) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [09:36:51] (03PS7) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [09:36:53] (03CR) 10Hashar: [C: 03+1] ci: Qemu image and snapshot creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [09:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21113 and previous config saved to /var/cache/conftool/dbconfig/20220221-093729-root.json [09:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [09:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:26] (KubernetesCalicoDown) resolved: kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:40:04] (03CR) 10David Caro: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [09:40:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:41:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [09:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:43:19] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:50] (03CR) 10jerkins-bot: [V: 04-1] k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [09:45:37] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc [09:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [09:48:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T300381)', diff saved to https://phabricator.wikimedia.org/P21114 and previous config saved to /var/cache/conftool/dbconfig/20220221-094819-marostegui.json [09:48:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:48:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:25] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21115 and previous config saved to /var/cache/conftool/dbconfig/20220221-094826-marostegui.json [09:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:48] 10SRE, 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) +SRE for visibility [09:51:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:51:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21116 and previous config saved to /var/cache/conftool/dbconfig/20220221-095122-kormat.json [09:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:29] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [09:51:37] !log running schema change against s7 T300774 [09:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bullseye [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21117 and previous config saved to /var/cache/conftool/dbconfig/20220221-095233-root.json [09:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21118 and previous config saved to /var/cache/conftool/dbconfig/20220221-095410-marostegui.json [09:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:15] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:55:49] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:56:30] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:57] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:57:32] !log installing PHP 7.4 security updates (as packaged in Debian) [09:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:50] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:59:41] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.05469 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:01:01] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:01:01] !log Rebuild templatelinks table on s2 codfw master (db2104), lag to be expected on codfw T301848 [10:01:01] (03CR) 10JMeybohm: [C: 03+1] "AIUI the task manager should use no more then task_manager_mem (taskmanager.memory.process.size) memory, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse) [10:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:06] T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848 [10:02:20] dcaro: it seems that r_lang::bioc is still used (was just removed by f4efb35f63) and triggering teh above Widespread puppet agent failures [10:02:33] see for example https://puppetboard.wikimedia.org/report/analytics1070.eqiad.wmnet/daabf3a68ee5e983656387462f5253ff22d565d9 [10:03:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21119 and previous config saved to /var/cache/conftool/dbconfig/20220221-100737-root.json [10:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:00] (03PS1) 10Ayounsi: Add drmrs interco v6 PTRs [dns] - 10https://gerrit.wikimedia.org/r/764314 [10:08:04] volans: ack, looking [10:08:25] thanks [10:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21120 and previous config saved to /var/cache/conftool/dbconfig/20220221-100914-marostegui.json [10:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] (03PS1) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 [10:10:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro) [10:11:24] I think that the issues is the biocLite.R file only, the rest is good to go [10:12:13] (03CR) 10JMeybohm: [C: 03+2] Update default prometheus-statsd-exporter version to 0.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/762463 (https://phabricator.wikimedia.org/T300629) (owner: 10JMeybohm) [10:14:11] (03PS2) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 [10:14:48] (03CR) 10jerkins-bot: [V: 04-1] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro) [10:15:00] (03PS3) 10David Caro: Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 [10:15:17] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [10:15:20] volans: feel free to do a quick review, only restored the offending file (that was removed in a last patch) [10:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:01] dcaro: ack, looking [10:16:15] althouhg I have zero context on those modules :) [10:16:34] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [10:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:44] (03PS1) 10Filippo Giunchedi: o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549) [10:17:48] (03CR) 10DCausse: flink-session-cluster: increase task manager mem limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/763754 (owner: 10DCausse) [10:19:21] (03PS1) 10David Caro: r_lang: remove unused biocLite.R [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) [10:20:06] (03CR) 10David Caro: "Not sure if this is correct, but seems like it, @mpopov will know better." [puppet] - 10https://gerrit.wikimedia.org/r/764318 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:20:54] volans: it was a cleanup that got an updated patch in the last minute removing that file (thinking that it was not used anymore), sent a followup patch removing the file and the entry and adding mpopov as reviewer (the person with the context) [10:21:14] (03CR) 10Volans: [C: 03+1] "LGTM, noop on PCC" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro) [10:21:27] (03CR) 10David Caro: [C: 03+2] Revert "r_lang::bioc: remove unused module" [puppet] - 10https://gerrit.wikimedia.org/r/763977 (owner: 10David Caro) [10:21:57] merged, should stop the errors [10:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21121 and previous config saved to /var/cache/conftool/dbconfig/20220221-102241-root.json [10:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:56] ack, thanks for the fix [10:24:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P21122 and previous config saved to /var/cache/conftool/dbconfig/20220221-102419-marostegui.json [10:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:53] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) [10:30:20] !log Deployed patch for T302192 [10:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:58] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bullseye [10:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:15] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) That was by design, the parameters used are defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/refs/head... [10:34:29] 10SRE, 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere) [10:34:30] (03CR) 10Elukey: [C: 03+1] o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi) [10:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [10:35:52] (03PS1) 10Elukey: Add overlayfs settings for kubestage1004 [puppet] - 10https://gerrit.wikimedia.org/r/764322 [10:38:21] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10MatthewVernon) p:05Triage→03Low a:03Ladsgroup [10:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T300381)', diff saved to https://phabricator.wikimedia.org/P21123 and previous config saved to /var/cache/conftool/dbconfig/20220221-103924-marostegui.json [10:39:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:39:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [10:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:30] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21124 and previous config saved to /var/cache/conftool/dbconfig/20220221-103931-marostegui.json [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:47] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21125 and previous config saved to /var/cache/conftool/dbconfig/20220221-104247-kormat.json [10:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:53] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:45:09] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/764314 (owner: 10Ayounsi) [10:46:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [10:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:19] 10SRE, 10Wikimedia-Mailing-lists, 10serviceops, 10User-Ladsgroup: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729 (10MatthewVernon) [10:47:28] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10fgiunchedi) IMHO the client should fail faster since while running it will block dpkg/apt in such cases [10:48:05] (03PS1) 10Marostegui: replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 [10:48:44] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [10:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [10:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:39] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for kubestage1004 [puppet] - 10https://gerrit.wikimedia.org/r/764322 (owner: 10Elukey) [10:53:34] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage1004.eqiad.wmnet with OS bullseye [10:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [10:53:40] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10MatthewVernon) 05Open→03Stalled p:05Triage→03Low @AndyRussG I'm making this "Stalled", and "Low" priority for now, since I think really you are... [10:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage [10:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:59] (03CR) 10Ayounsi: [C: 03+2] Add drmrs interco v6 PTRs [dns] - 10https://gerrit.wikimedia.org/r/764314 (owner: 10Ayounsi) [10:56:22] (03CR) 10Ladsgroup: [C: 03+1] replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui) [10:57:31] (03CR) 10Marostegui: [C: 03+2] replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui) [10:57:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1022.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [10:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21126 and previous config saved to /var/cache/conftool/dbconfig/20220221-105752-kormat.json [10:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:59] (03Merged) 10jenkins-bot: replica_set.py: Change message [software] - 10https://gerrit.wikimedia.org/r/764323 (owner: 10Marostegui) [10:59:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1022.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [10:59:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [10:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:39] (03CR) 10Jbond: conftool: add request-actions / request-patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [11:03:26] (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:05:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bullseye [11:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:36] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:05:52] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:07:29] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: temp relax of LogstashIndexingFailures [alerts] - 10https://gerrit.wikimedia.org/r/764316 (https://phabricator.wikimedia.org/T288549) (owner: 10Filippo Giunchedi) [11:08:30] 10SRE, 10Machine-Learning-Team, 10Observability-Logging, 10Patch-For-Review: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10fgiunchedi) I've bandaided the issue for now, though we should go back to a short `for` clause once the root cause is fixed [11:08:48] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005362 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:09:04] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:09:46] (03CR) 10Jbond: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [11:09:57] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [11:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [11:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21127 and previous config saved to /var/cache/conftool/dbconfig/20220221-111256-kormat.json [11:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:26] (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:13:56] (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:17:31] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:18:56] (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:18:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [11:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:56] (KubernetesCalicoDown) firing: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:22:31] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:23:16] (03PS6) 10Jbond: R:tlsproxy::localssl: Add cfssl support to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/762535 [11:23:39] <_joe_> this calicodown is for staging, right elukey jayme ? [11:24:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1004.eqiad.wmnet with OS bullseye [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [11:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:47] <_joe_> yeah if felix was really down, we'd see way more alerts firing [11:24:56] (KubernetesCalicoDown) resolved: kubestage1004.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:25:36] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:26:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1012.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] _joe_ yep it is me reimaging 1004 [11:27:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1012.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:02] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21128 and previous config saved to /var/cache/conftool/dbconfig/20220221-112801-kormat.json [11:28:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:28:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:28:05] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:28:09] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21129 and previous config saved to /var/cache/conftool/dbconfig/20220221-112809-kormat.json [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:15] (03PS1) 10Jbond: P:netbox: ltidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [11:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:31] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubesvc [11:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] (03CR) 10Ayounsi: [C: 03+2] drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi) [11:29:23] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: ltidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [11:30:22] (03Merged) 10jenkins-bot: drmrs: Anycast tuning for Tata [homer/public] - 10https://gerrit.wikimedia.org/r/763696 (owner: 10Ayounsi) [11:33:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 77): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33868/console" [puppet] - 10https://gerrit.wikimedia.org/r/762535 (owner: 10Jbond) [11:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21130 and previous config saved to /var/cache/conftool/dbconfig/20220221-113950-marostegui.json [11:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:57] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:40:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:07] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21131 and previous config saved to /var/cache/conftool/dbconfig/20220221-114307-kormat.json [11:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:44:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:44:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:10] (03PS2) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [11:48:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:48:46] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [11:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21132 and previous config saved to /var/cache/conftool/dbconfig/20220221-115455-marostegui.json [11:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:10] PROBLEM - Ensure traffic_server is running for instance backend on cp6010 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:55:24] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:55:34] PROBLEM - traffic_server tls process restarted on cp6014 is CRITICAL: 27 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=tls [11:55:38] PROBLEM - traffic_server tls process restarted on cp6010 is CRITICAL: 25 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6010&var-layer=tls [11:55:50] PROBLEM - traffic_server backend process restarted on cp6014 is CRITICAL: 62 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6014&var-layer=backend [11:55:56] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:57:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 T301848', diff saved to https://phabricator.wikimedia.org/P21133 and previous config saved to /var/cache/conftool/dbconfig/20220221-115750-marostegui.json [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:57] T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848 [11:58:11] !log Rebuild templatelinks table on db1129 (s2) T301848 [11:58:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21134 and previous config saved to /var/cache/conftool/dbconfig/20220221-115811-kormat.json [11:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:38] (03PS8) 10Giuseppe Lavagetto: k8s: add module [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) [12:06:08] (03CR) 10Hnowlan: [C: 03+2] C:cassandra: add optional java_package variable [puppet] - 10https://gerrit.wikimedia.org/r/722599 (https://phabricator.wikimedia.org/T261966) (owner: 10Jbond) [12:06:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [12:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:06] PROBLEM - traffic_server tls process restarted on cp6016 is CRITICAL: 9 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=drmrs+prometheus/ops&var-instance=cp6016&var-layer=tls [12:10:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P21135 and previous config saved to /var/cache/conftool/dbconfig/20220221-120959-marostegui.json [12:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [12:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21136 and previous config saved to /var/cache/conftool/dbconfig/20220221-121316-kormat.json [12:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1017.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10Michael.Hay) >>! In T301782#7720584, @MMandere wrote: > Thank you @JBennett for the approval. @Michael.Hay please sign the [[ https://phabricator.wikimedia.org/L3 | L3... [12:16:58] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [12:19:26] PROBLEM - Number of messages locally queued by purged for processing on cp6016 is CRITICAL: cluster=cache_text instance=cp6016 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [12:21:02] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:21:22] (03PS1) 10Hnowlan: maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664) [12:23:18] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T300381)', diff saved to https://phabricator.wikimedia.org/P21137 and previous config saved to /var/cache/conftool/dbconfig/20220221-122504-marostegui.json [12:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:10] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1110', diff saved to https://phabricator.wikimedia.org/P21138 and previous config saved to /var/cache/conftool/dbconfig/20220221-122727-marostegui.json [12:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21139 and previous config saved to /var/cache/conftool/dbconfig/20220221-122821-kormat.json [12:28:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:28:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:33] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:28:34] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.68e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:52] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:30:17] !log Deployed patch for T302215 [12:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:00] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:31:16] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:31:18] RECOVERY - Number of messages locally queued by purged for processing on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [12:31:42] PROBLEM - Check systemd state on cp6016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_intel_microcode.service,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21140 and previous config saved to /var/cache/conftool/dbconfig/20220221-123335-root.json [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:34:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:35:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1017.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [12:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:52] !log Rebuild templatelinks table on db2077 (s7) T301848 [12:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:57] T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848 [12:40:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:40:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:42:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21141 and previous config saved to /var/cache/conftool/dbconfig/20220221-124215-marostegui.json [12:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:23] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:42:50] (03PS1) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) [12:45:42] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:45:43] (03CR) 10Majavah: [C: 04-1] Add per-subnet netboot conf files for new row E-F subnets in Eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:48:04] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [12:48:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21142 and previous config saved to /var/cache/conftool/dbconfig/20220221-124839-root.json [12:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:14] PROBLEM - Number of messages locally queued by purged for processing on cp6014 is CRITICAL: cluster=cache_text instance=cp6014 job=purged layer=frontend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [12:52:57] (03PS2) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) [12:53:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21143 and previous config saved to /var/cache/conftool/dbconfig/20220221-125303-marostegui.json [12:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:10] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:53:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:53:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21144 and previous config saved to /var/cache/conftool/dbconfig/20220221-125326-kormat.json [12:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:33] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:55:06] (03PS3) 10Cathal Mooney: Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) [12:56:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [12:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [13:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:35] (03PS6) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [13:02:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1009.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:47] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) [13:03:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [13:03:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21145 and previous config saved to /var/cache/conftool/dbconfig/20220221-130343-root.json [13:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1009.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:56] !log rebalance ganeti row_C (add nodes reimaged in there) T296721 [13:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:01] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [13:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21146 and previous config saved to /var/cache/conftool/dbconfig/20220221-130808-marostegui.json [13:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:21] (03CR) 10Hnowlan: [C: 03+2] api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [13:10:30] (03CR) 10Cathal Mooney: "Thanks Majavah fixed." [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:11:50] (03Merged) 10jenkins-bot: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [13:14:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21147 and previous config saved to /var/cache/conftool/dbconfig/20220221-131423-kormat.json [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:29] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:15:14] 10SRE, 10DC-Ops, 10serviceops: setup/install mc20[38-55] - https://phabricator.wikimedia.org/T302218 (10akosiaris) [13:16:37] RECOVERY - Number of messages locally queued by purged for processing on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [13:18:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21148 and previous config saved to /var/cache/conftool/dbconfig/20220221-131846-root.json [13:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33869/" [puppet] - 10https://gerrit.wikimedia.org/r/763748 (owner: 10Ayounsi) [13:23:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P21149 and previous config saved to /var/cache/conftool/dbconfig/20220221-132313-marostegui.json [13:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:55] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763821 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [13:25:04] (03PS1) 10Krinkle: Increase logging of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) [13:25:26] 10SRE, 10Anti-Harassment, 10DBA: Error Unknown column ipb_sitewide in field list on query - https://phabricator.wikimedia.org/T208462 (10DonPaolo) I upgraded to 1.37 from 1.31, and I got the error of ipb_sitewide missing. I had to manually run "ALTER TABLE ipblocks ADD ipb_sitewide bool NOT NULL default... [13:29:28] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21150 and previous config saved to /var/cache/conftool/dbconfig/20220221-132928-kormat.json [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:21] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33870/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/763750 (owner: 10Ayounsi) [13:31:42] (03PS2) 10Ayounsi: Disable Junos alarms check by default [puppet] - 10https://gerrit.wikimedia.org/r/763750 [13:31:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:33:47] (03CR) 10Cathal Mooney: [C: 03+2] Add per-subnet netboot conf files for new row E-F subnets in Eqiad [puppet] - 10https://gerrit.wikimedia.org/r/764355 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21151 and previous config saved to /var/cache/conftool/dbconfig/20220221-133350-root.json [13:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21152 and previous config saved to /var/cache/conftool/dbconfig/20220221-133818-marostegui.json [13:38:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:38:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:24] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Hannah Okwelum - https://phabricator.wikimedia.org/T302212 (10Atieno) Hello. This is approved from my end. Cheers. [13:43:16] 10SRE, 10DC-Ops, 10cloud-services-team (Kanban): Supporting new hardware in older debian releases - https://phabricator.wikimedia.org/T301162 (10MatthewVernon) p:05Triage→03Medium [13:44:33] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21153 and previous config saved to /var/cache/conftool/dbconfig/20220221-134433-kormat.json [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:57] (03CR) 10JMeybohm: [C: 03+1] "This breaks down to an additional 6 CPUs (in limits) for cp-jobqueue (just FTR) - should (still ;-)) be fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [13:45:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:45:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21154 and previous config saved to /var/cache/conftool/dbconfig/20220221-134542-marostegui.json [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:48:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763792 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [13:49:05] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [13:52:22] (03PS1) 10Muehlenhoff: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 [13:53:01] (03CR) 10jerkins-bot: [V: 04-1] Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff) [13:54:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21156 and previous config saved to /var/cache/conftool/dbconfig/20220221-135417-marostegui.json [13:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:58:57] (03PS1) 10Ayounsi: Icinga/netops re-organize devices [puppet] - 10https://gerrit.wikimedia.org/r/764367 [13:59:38] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300774)', diff saved to https://phabricator.wikimedia.org/P21158 and previous config saved to /var/cache/conftool/dbconfig/20220221-135937-kormat.json [13:59:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:59:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:44] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:59:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21159 and previous config saved to /var/cache/conftool/dbconfig/20220221-135945-kormat.json [13:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:07] (03PS2) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 [14:00:34] 10SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for Taavi Väänänen (Majavah) - https://phabricator.wikimedia.org/T301579 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, I've done the LDAP change; it'll take an hour for the cache on gerrit to clear [I'm not the right flavour of admin... [14:00:58] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [14:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:03] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:05:32] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [14:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [I think this task can be closed, since the issue was resolve... [14:06:21] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 7074 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:06:28] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10MatthewVernon) [14:08:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff) [14:08:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21160 and previous config saved to /var/cache/conftool/dbconfig/20220221-140831-root.json [14:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21161 and previous config saved to /var/cache/conftool/dbconfig/20220221-140922-marostegui.json [14:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:09:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:09:59] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:10:10] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MatthewVernon) Does this still need #WMF-NDA-Requests tagging in it? It means it appears in the Clinic Duty dashboard, which is probably not what w... [14:11:02] (03CR) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [14:16:19] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:17:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [14:18:14] (03CR) 10JMeybohm: [C: 04-1] "I did try to parse the templates manually based on the data in https://phabricator.wikimedia.org/P21048 and came up with the same the same" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [14:19:14] the thanos rule alert is me [14:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21162 and previous config saved to /var/cache/conftool/dbconfig/20220221-141931-root.json [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] hah, the icinga configuration is busted but not sure exactly why [14:21:53] Error: 'lsw1-e3-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'elastic1093' (file '/etc/icinga/objects/puppet_hosts.cfg', line 21621)! [14:22:01] cc XioNoX ^ perhaps ? [14:22:19] I think I know [14:22:27] cc topranks [14:22:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [14:22:29] !log installing twisted security updates [14:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:53] ah yeah that'd make sense, thanks [14:22:54] (03PS3) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [14:23:08] there is some automation to define icinga parents automatically based on LLDP [14:23:33] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [14:23:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33871/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [14:23:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21163 and previous config saved to /var/cache/conftool/dbconfig/20220221-142337-root.json [14:23:40] godog: thanks yep that does make sense. [14:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] I was testing imaging that host, but yes it's "parent" switch isn't in monitoring, causing this [14:24:22] sorry hadn't anticipated the issue, let me try to sort it out. [14:24:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P21164 and previous config saved to /var/cache/conftool/dbconfig/20220221-142426-marostegui.json [14:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:41] sure no worries, LMK if I can help topranks [14:24:48] topranks: https://github.com/wikimedia/puppet/blob/production/hieradata/common/monitoring.yaml and https://github.com/wikimedia/puppet/blob/production/modules/netops/manifests/monitoring.pp [14:25:27] cool thanks XioNoX, yeah getting close to the time to add them there. [14:25:40] topranks: and I sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/764367 to re-organize the monitoring file [14:25:56] it should make it easier for you to add your devices [14:26:05] (03PS4) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [14:26:29] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi) [14:26:39] Ok yeah looks like it will thanks. [14:26:45] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [14:26:52] (03PS2) 10Gehel: cirrus: Alert when the rate of pages fixed by Saneitizer is too high [puppet] - 10https://gerrit.wikimedia.org/r/763573 (https://phabricator.wikimedia.org/T295365) (owner: 10Ebernhardson) [14:28:05] (03PS3) 10Muehlenhoff: ganeti: Retire ganeti216 option [puppet] - 10https://gerrit.wikimedia.org/r/764363 [14:29:32] (03PS5) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [14:30:09] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [14:30:32] topranks: just for expectation' sake, do you have an approximate ETA for the fix ? I'm asking because unfortunately icinga config invalid blocks all other changes to its config too :( [14:31:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/764363 (owner: 10Muehlenhoff) [14:32:15] godog: can that host be forced out of icinga for now? [14:32:57] I wasn't 100% sure what to do with it. The reimage failed anyway, or looks like it will. [14:33:09] [54/120, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get uptime for elastic1093.eqiad.wmnet [14:33:19] ^^ current state. [14:33:39] So maybe I can just cancel and run decommission and then re-try again once switches have been added to mgmt? [14:33:55] topranks: have you checked the console? with install_console [14:34:15] no I can certainly try that [14:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21165 and previous config saved to /var/cache/conftool/dbconfig/20220221-143435-root.json [14:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:49] that's the reboot *after* the first puppet run [14:34:53] so the host should come back [14:35:01] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33873/" [puppet] - 10https://gerrit.wikimedia.org/r/764367 (owner: 10Ayounsi) [14:35:23] XioNoX: mmhh not as selectively let's say, but if it is e.g. deactivated from puppetdb it won't show up in icinga [14:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [14:36:40] volans: host did reboot alright, console is sitting at login prompt. [14:36:58] when I try to run "install_console" I'm getting prompted for a pw though [14:37:02] https://www.irccloud.com/pastebin/JAeatIDm/ [14:37:08] Is that normal? [14:37:27] depends, if puppet run successfully yes [14:37:32] the key gets removed [14:37:50] ah ok yeah, root pw worked fine yep. [14:37:54] is valid only during the first installation [14:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21166 and previous config saved to /var/cache/conftool/dbconfig/20220221-143841-root.json [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] topranks: cumin can't reach it [14:39:30] yeah it's trying the v6 address and failing [14:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300381)', diff saved to https://phabricator.wikimedia.org/P21167 and previous config saved to /var/cache/conftool/dbconfig/20220221-143931-marostegui.json [14:39:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:39:34] or actually is very slow [14:39:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:46] https://www.irccloud.com/pastebin/KaX4i6Tu/ [14:40:13] why is v6 not working? [14:40:26] The v6 address is not configured on the host, why I do not know. [14:40:28] my understanding is that the check for uptime is timing out [14:41:14] yeah from the error message that's what it looks like [14:41:23] I assume cose of this v6 thing. [14:41:42] Device's v6 IP is not defined in /etc/network/interfaces [14:41:49] (03PS6) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [14:42:01] It is configured to add a link local: up ip addr add fe80::10:64:132:2/64 dev enp59s0f0np0 [14:42:34] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [14:42:45] it should have https://netbox.wikimedia.org/ipam/ip-addresses/10174/ too [14:43:27] Ok. So the debian installer didn't create the "interfaces" file right for some reason. [14:44:10] apparently so [14:45:22] the timeout is set to 10s that for cat /proc/uptime seems an eternity [14:45:28] yet it manages to trigger it [14:45:40] (03PS1) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) [14:45:47] from one side actually good so we did notice the issue [14:46:19] (03CR) 10jerkins-bot: [V: 04-1] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:46:43] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.865e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:46:45] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:47:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:47:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21168 and previous config saved to /var/cache/conftool/dbconfig/20220221-144707-marostegui.json [14:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:15] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:47:28] (03PS2) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) [14:47:37] Ah I may know what's up. [14:48:17] Our IPv6 allocation on servers is dependent on the device already having gotten an address using SLAAC? [14:49:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:49:13] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [14:49:20] Which we then extract the prefix from and set the v6. [14:49:32] Right that's an issue - I didn't have these new switches set up to do SLAAC. [14:49:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21169 and previous config saved to /var/cache/conftool/dbconfig/20220221-144938-root.json [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33878/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:52:02] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1093.eqiad.wmnet with OS bullseye [14:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye executed... [14:53:38] topranks: ack [14:53:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21170 and previous config saved to /var/cache/conftool/dbconfig/20220221-145345-root.json [14:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:01] if you need to run the reimage you can just re-run it removing the --new option, no need for decom [14:54:26] ah ok good tip thanks. [14:55:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21171 and previous config saved to /var/cache/conftool/dbconfig/20220221-145556-marostegui.json [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:00:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21172 and previous config saved to /var/cache/conftool/dbconfig/20220221-150004-kormat.json [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:10] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:00:59] (03PS7) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [15:01:16] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Hannah Okwelum - https://phabricator.wikimedia.org/T302212 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, I've done this. Regards, Matthew [15:01:54] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [15:03:01] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: more replicas, less CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:03:39] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.678e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:03:43] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21173 and previous config saved to /var/cache/conftool/dbconfig/20220221-150442-root.json [15:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:05] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:06:09] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:06:27] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) a:03Joe [15:06:54] (03Merged) 10jenkins-bot: changeprop-jobqueue: more replicas, less CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/762418 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:07:23] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10MatthewVernon) @Zabe can I confirm you've been in touch with legal directly? [15:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21174 and previous config saved to /var/cache/conftool/dbconfig/20220221-150848-root.json [15:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:10] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [15:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:50] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [15:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] (03PS1) 10Elukey: Add new k8s partman recipe to ml-serve[12]00[1-4] nodes [puppet] - 10https://gerrit.wikimedia.org/r/764374 [15:10:45] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21175 and previous config saved to /var/cache/conftool/dbconfig/20220221-151101-marostegui.json [15:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:17] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Joe) I think this is the old etcd certificate we used to use for etcd in codfw; since we've moved to etcd v3 we're using a new cert created with cergen: ` $ openssl s_client -host conf2004.codfw.wmnet -p... [15:14:27] (03CR) 10Elukey: [C: 03+2] Add new k8s partman recipe to ml-serve[12]00[1-4] nodes [puppet] - 10https://gerrit.wikimedia.org/r/764374 (owner: 10Elukey) [15:15:09] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21176 and previous config saved to /var/cache/conftool/dbconfig/20220221-151509-kormat.json [15:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:51] (03PS3) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) [15:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P21177 and previous config saved to /var/cache/conftool/dbconfig/20220221-151945-root.json [15:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:39] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:21:29] 10SRE: Options for creating internal (NDA-requiring) dashboards based on data from Google and Bing search consoles - https://phabricator.wikimedia.org/T298991 (10AndyRussG) >>! In T298991#7725178, @MatthewVernon wrote: > @AndyRussG I'm making this "Stalled", and "Low" priority for now, since I think really you a... [15:21:39] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33882/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:23:05] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:24:12] (03PS4) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) [15:24:43] (03PS1) 10Elukey: Add overlayfs settings for ml-serve2001 [puppet] - 10https://gerrit.wikimedia.org/r/764376 [15:24:45] (03PS1) 10Elukey: Add overlayfs settings for ml-serve2002 [puppet] - 10https://gerrit.wikimedia.org/r/764377 [15:24:47] (03PS1) 10Elukey: Add overlayfs settings for ml-serve2003 [puppet] - 10https://gerrit.wikimedia.org/r/764378 [15:24:49] (03PS1) 10Elukey: Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 [15:25:13] (03CR) 10jerkins-bot: [V: 04-1] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:25:36] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [15:26:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P21178 and previous config saved to /var/cache/conftool/dbconfig/20220221-152606-marostegui.json [15:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:19] (03CR) 10Giuseppe Lavagetto: Add *.k8s-staging.discovery.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:26:58] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2001 [puppet] - 10https://gerrit.wikimedia.org/r/764376 (owner: 10Elukey) [15:28:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:28:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2001.codfw.wmnet with OS bullseye [15:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:44] (03PS2) 10Vgutierrez: prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) [15:30:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21179 and previous config saved to /var/cache/conftool/dbconfig/20220221-153013-kormat.json [15:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:50] !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9]: Deploy Aqs Hourly for Airflow [analytics/refinery@ed5c9f9] [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:16] (03CR) 10JMeybohm: "@Brandon: Would you mind taking a look if that's something you think is okay to do?" [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [15:34:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:09] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:35:26] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:35:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:06] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) >>! In T302163#7725826, @MatthewVernon wrote: > @Zabe can I confirm you've been in touch with legal directly? Yes, I have sent an email to leg... [15:39:57] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:41:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300381)', diff saved to https://phabricator.wikimedia.org/P21180 and previous config saved to /var/cache/conftool/dbconfig/20220221-154110-marostegui.json [15:41:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:41:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:18] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21181 and previous config saved to /var/cache/conftool/dbconfig/20220221-154118-marostegui.json [15:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:47] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Aggregation rules for HAProxy TTFB [puppet] - 10https://gerrit.wikimedia.org/r/763566 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:44:02] 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10TheresNoTime) [15:45:19] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300774)', diff saved to https://phabricator.wikimedia.org/P21182 and previous config saved to /var/cache/conftool/dbconfig/20220221-154518-kormat.json [15:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:45:24] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:45:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [15:45:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [15:45:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [15:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [15:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [15:47:51] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Lucas_Werkmeister_WMDE) I support this access request, and will be happy to provide assistance to @TheresNoTime if needed. 👍 [15:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:34] (03PS5) 10Filippo Giunchedi: prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) [15:50:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21183 and previous config saved to /var/cache/conftool/dbconfig/20220221-155034-marostegui.json [15:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:40] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:51:55] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33886/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:52:05] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:52:13] !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9]: Deploy Aqs Hourly for Airflow [analytics/refinery@ed5c9f9] (duration: 21m 23s) [15:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:30] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:58:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:59:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:59:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:25] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21184 and previous config saved to /var/cache/conftool/dbconfig/20220221-155924-kormat.json [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:34] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:59:49] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:00:26] (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:01:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33887/console" [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:01:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2001.codfw.wmnet with OS bullseye [16:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:51] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: default Host header to SNI [puppet] - 10https://gerrit.wikimedia.org/r/764371 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:01:53] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1093.eqiad.wmnet with OS bullseye [16:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye [16:03:15] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=ml-serve,service=kubesvc [16:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:28] mmmm [16:04:21] ah ml_serve uff [16:04:35] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=ml_serve,service=kubesvc [16:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:14] ah the hostname is ml-serve but the cluster is ml_serve ? /o\ [16:05:34] can we fix it or it is already too late? [16:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21185 and previous config saved to /var/cache/conftool/dbconfig/20220221-160538-marostegui.json [16:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:45] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve200[5-8].codfw.wmnet [16:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:28] godog: I don't recall why it was done in that way [16:07:10] yeah IIRC there's nothing wrong with dashes in cluster name [16:07:39] maybe I'm misremembering though [16:08:08] mmhh no should be fine, we have wqds-test for example [16:08:38] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21186 and previous config saved to /var/cache/conftool/dbconfig/20220221-160838-kormat.json [16:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:45] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:09:23] PROBLEM - ganeti-noded running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:09:42] (03Abandoned) 10Hashar: Remove bot humors for deployers [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/734964 (owner: 10Hashar) [16:10:04] godog: I think that it was named after the cluster in wikimedia_clusters [16:10:09] that is named ml_serve [16:10:10] mmmm [16:10:20] so we should rename the conftool config right? [16:10:27] PROBLEM - ganeti-confd running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:10:47] PROBLEM - ganeti-mond running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:11:20] elukey: I believe so, the conftool bits and also the "cluster" variable in wikimedia_clusters I think would be nice if they matched [16:11:52] not the end of the world all things considered but one of those friction points that compounds [16:12:22] I see it can be possible to use - and _ [16:12:39] so they are currently matching though, ml_serve [16:12:48] the first conftool action I think was a no-op [16:12:50] (my bad) [16:14:02] yeah they are consistent between each other but not with the hostnames ml-serve [16:14:13] okok [16:14:36] I can try to work on it, hope that it will not break too many things [16:15:30] if you can/want I think it'll pay off, if not that's fine too [16:15:42] I don't want to jinx but I think most/all things should DTRT [16:15:59] okok I'll try, the scary part is pybal but hopefully it should work [16:16:17] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:18] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2002 [puppet] - 10https://gerrit.wikimedia.org/r/764377 (owner: 10Elukey) [16:16:21] (03PS1) 10Hnowlan: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956) [16:17:03] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [16:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] beta: Allow opening the alpha NewLexeme special page on beta-wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763703 (https://phabricator.wikimedia.org/T301234) (owner: 10Michael Große) [16:18:07] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bullseye [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P21187 and previous config saved to /var/cache/conftool/dbconfig/20220221-162043-marostegui.json [16:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:30] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1093.eqiad.wmnet with reason: host reimage [16:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:43] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21188 and previous config saved to /var/cache/conftool/dbconfig/20220221-162342-kormat.json [16:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:24:26] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:25:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:03] RECOVERY - ganeti-confd running on ganeti1005 is OK: PROCS OK: 1 process with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [16:27:27] RECOVERY - ganeti-mond running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [16:28:17] RECOVERY - ganeti-noded running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [16:30:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1093.eqiad.wmnet with OS bullseye [16:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host elastic1093.eqiad.wmnet with OS bullseye complete... [16:31:06] topranks: yay it worked this time :D [16:31:31] haha... still looking at the logs afraid to say that :) [16:31:37] But yes appears to have worked fine :) [16:31:39] woohoo! [16:32:02] :) [16:34:26] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:34:44] very nice! [16:34:51] (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [16:34:56] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:35:17] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [16:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300381)', diff saved to https://phabricator.wikimedia.org/P21189 and previous config saved to /var/cache/conftool/dbconfig/20220221-163548-marostegui.json [16:35:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:35:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21190 and previous config saved to /var/cache/conftool/dbconfig/20220221-163555-marostegui.json [16:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:47] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 4.592e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:36:51] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:36:56] !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9] (thin): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:03] !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9] (thin): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] (duration: 00m 07s) [16:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:37] !log mforns@deploy1002 Started deploy [analytics/refinery@ed5c9f9] (hadoop-test): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] [16:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [16:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:38] (03Merged) 10jenkins-bot: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/764409 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [16:38:48] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21191 and previous config saved to /var/cache/conftool/dbconfig/20220221-163847-kormat.json [16:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6010 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:39:16] (03CR) 10Klausman: "Does this supersede the other change? It only edits the 2003 yaml file which is deleted here." [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey) [16:39:19] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:39:56] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:40:11] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:41:41] (03CR) 10Elukey: Add overlayfs settings to ml-serve2004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey) [16:43:02] (03CR) 10Klausman: [C: 03+1] Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey) [16:43:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33888/console" [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey) [16:43:58] 10SRE: Issue installing ca-certificates-java openjdk 11 - https://phabricator.wikimedia.org/T300300 (10colewhite) [16:44:34] (03PS8) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [16:44:49] !log mforns@deploy1002 Finished deploy [analytics/refinery@ed5c9f9] (hadoop-test): Deploy Aqs Hourly for Airflow THIN [analytics/refinery@ed5c9f9] (duration: 07m 12s) [16:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:16] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [16:46:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21192 and previous config saved to /var/cache/conftool/dbconfig/20220221-164608-marostegui.json [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:47:58] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [16:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:17] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [16:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:49] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:11] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:50:42] (03PS9) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [16:50:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2002.codfw.wmnet with OS bullseye [16:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:23] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [16:53:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300774)', diff saved to https://phabricator.wikimedia.org/P21193 and previous config saved to /var/cache/conftool/dbconfig/20220221-165352-kormat.json [16:53:53] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:53:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:53:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:53:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:54:01] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:02] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:54:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21194 and previous config saved to /var/cache/conftool/dbconfig/20220221-165405-kormat.json [16:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21195 and previous config saved to /var/cache/conftool/dbconfig/20220221-165616-kormat.json [16:56:21] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [16:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:51] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings for ml-serve2003 [puppet] - 10https://gerrit.wikimedia.org/r/764378 (owner: 10Elukey) [16:59:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2003.codfw.wmnet with OS bullseye [16:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21196 and previous config saved to /var/cache/conftool/dbconfig/20220221-170113-marostegui.json [17:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:30] (03PS2) 10Andrew Bogott: nfs add_server: disable nfs mounts for new nfs servers [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/763955 [17:02:40] (03PS2) 10Elukey: Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 [17:02:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10JMeybohm) [17:02:47] (03PS10) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [17:03:48] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:03:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33892/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:05:33] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:25] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@f1244e0]: Migrate aqs/hourly from Oozie|Hive to Airflow|Spark [17:06:26] (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:33] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@f1244e0]: Migrate aqs/hourly from Oozie|Hive to Airflow|Spark (duration: 00m 07s) [17:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:08] (03CR) 10Andrew Bogott: [C: 03+2] Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/763805 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [17:07:19] (03PS2) 10Andrew Bogott: Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/763805 (https://phabricator.wikimedia.org/T281276) [17:08:09] (03PS11) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [17:09:08] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:09:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33893/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:10:17] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7722815, @MSantos wrote: > @akosiaris and @jijiki how can we move forward with this? > > For context: > - [[... [17:10:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] run_ci_locally.sh: add podman support [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway) [17:11:22] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21197 and previous config saved to /var/cache/conftool/dbconfig/20220221-171121-kormat.json [17:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:41] (03PS12) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [17:14:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33894/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:14:53] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:16:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P21198 and previous config saved to /var/cache/conftool/dbconfig/20220221-171618-marostegui.json [17:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10Ladsgroup) I also support this request, TNT had production access before and trusted and has been instrumental in lot of work in incidents and any area possible. So much <3 for her. [17:16:26] (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:16:49] this is me reimaging --^ [17:16:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [17:16:56] (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [17:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:08] (03PS1) 10Andrew Bogott: Revert "Move service name openstack.eqiad1.wikimediacloud.org to cloudcontrol1005" [dns] - 10https://gerrit.wikimedia.org/r/764421 [17:20:35] (03PS10) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [17:20:47] (03CR) 10jerkins-bot: [V: 04-1] Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:21:56] (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:22:56] (KubernetesCalicoDown) firing: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:26:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1003.wikimedia.org with OS bullseye [17:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21199 and previous config saved to /var/cache/conftool/dbconfig/20220221-172626-kormat.json [17:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:22] (03PS13) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [17:30:13] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:30:41] (03PS14) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [17:31:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300381)', diff saved to https://phabricator.wikimedia.org/P21200 and previous config saved to /var/cache/conftool/dbconfig/20220221-173122-marostegui.json [17:31:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [17:31:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [17:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:28] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21201 and previous config saved to /var/cache/conftool/dbconfig/20220221-173130-marostegui.json [17:31:31] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33895/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [17:32:21] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:45] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@c2fdce7]: fix aqs hourly DAGs start date [17:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:52] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@c2fdce7]: fix aqs hourly DAGs start date (duration: 00m 07s) [17:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:56] (KubernetesCalicoDown) resolved: ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:33:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2003.codfw.wmnet with OS bullseye [17:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:16] (03CR) 10Elukey: [C: 03+2] Add overlayfs settings to ml-serve2004 [puppet] - 10https://gerrit.wikimedia.org/r/764379 (owner: 10Elukey) [17:38:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bullseye [17:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:31] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300774)', diff saved to https://phabricator.wikimedia.org/P21202 and previous config saved to /var/cache/conftool/dbconfig/20220221-174130-kormat.json [17:41:32] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:41:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:37] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:41:38] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21203 and previous config saved to /var/cache/conftool/dbconfig/20220221-174138-kormat.json [17:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21204 and previous config saved to /var/cache/conftool/dbconfig/20220221-174335-marostegui.json [17:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:41] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:44:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:56] (KubernetesCalicoDown) firing: (2) ml-serve2003.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:45:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:16] (03CR) 10Jbond: "lgtm but see nits" [puppet] - 10https://gerrit.wikimedia.org/r/763807 (owner: 10JHathaway) [17:46:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/763856 (owner: 10JHathaway) [17:47:45] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:51] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21205 and previous config saved to /var/cache/conftool/dbconfig/20220221-174750-kormat.json [17:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:57] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:50:13] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@17a70a0]: fix missing extra_query_parameters [17:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:20] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@17a70a0]: fix missing extra_query_parameters (duration: 00m 07s) [17:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [17:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [17:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21206 and previous config saved to /var/cache/conftool/dbconfig/20220221-175839-marostegui.json [17:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:11] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [18:01:11] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [18:01:18] ACKNOWLEDGEMENT - MD RAID on ml-serve2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.11. Check system logs on 10.192.48.11 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T302240 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:01:22] 10SRE, 10ops-codfw: Degraded RAID on ml-serve2004 - https://phabricator.wikimedia.org/T302240 (10ops-monitoring-bot) [18:02:00] what [18:02:14] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [18:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:27] mmm ok now it is a cornercase [18:02:27] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [18:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21207 and previous config saved to /var/cache/conftool/dbconfig/20220221-180255-kormat.json [18:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1003.wikimedia.org with reason: host reimage [18:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:11] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [18:07:11] (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [18:07:34] (03PS15) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:07:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1003.wikimedia.org with reason: host reimage [18:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:12] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:08:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33896/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:09:50] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 92, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:56] (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [18:11:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 125, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:11:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2004.codfw.wmnet with OS bullseye [18:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P21208 and previous config saved to /var/cache/conftool/dbconfig/20220221-181344-marostegui.json [18:13:45] (03PS16) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:28] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:14:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33897/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:17:28] 10SRE, 10ops-codfw: Degraded RAID on ml-serve2004 - https://phabricator.wikimedia.org/T302240 (10elukey) 05Open→03Invalid Node being reimaged. [18:18:00] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21209 and previous config saved to /var/cache/conftool/dbconfig/20220221-181800-kormat.json [18:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:56] (03PS17) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:21:38] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:22:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33898/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:22:54] (03PS1) 10Andrew Bogott: Set profile::openstack::XXX::keystone::wsgi_server to 'keystone' everywhere [puppet] - 10https://gerrit.wikimedia.org/r/764430 (https://phabricator.wikimedia.org/T281276) [18:24:49] (03CR) 10Andrew Bogott: [C: 03+2] Set profile::openstack::XXX::keystone::wsgi_server to 'keystone' everywhere [puppet] - 10https://gerrit.wikimedia.org/r/764430 (https://phabricator.wikimedia.org/T281276) (owner: 10Andrew Bogott) [18:25:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33900/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300381)', diff saved to https://phabricator.wikimedia.org/P21210 and previous config saved to /var/cache/conftool/dbconfig/20220221-182849-marostegui.json [18:28:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:28:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:56] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [18:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21211 and previous config saved to /var/cache/conftool/dbconfig/20220221-182856-marostegui.json [18:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:49] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [18:32:13] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [18:33:04] !log Password reset for Jrnka ka@SUL per Ticket#2022022010002692 [18:33:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300774)', diff saved to https://phabricator.wikimedia.org/P21212 and previous config saved to /var/cache/conftool/dbconfig/20220221-183304-kormat.json [18:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:14] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [18:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [18:36:31] (03PS18) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:37:17] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:37:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33901/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21213 and previous config saved to /var/cache/conftool/dbconfig/20220221-183751-marostegui.json [18:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:57] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [18:39:15] (03CR) 10Urbanecm: "code looks good, but I'd appreciate Daimona's opinion here, as they're one of the AF experts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [18:40:26] (03CR) 10Daimona Eaytoy: [C: 03+1] "Seems fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [18:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21214 and previous config saved to /var/cache/conftool/dbconfig/20220221-185256-marostegui.json [18:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:05] (03PS19) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:55:56] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [18:59:10] (03PS20) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [18:59:57] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:00:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33903/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:03:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1003.wikimedia.org with OS bullseye [19:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33904/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P21215 and previous config saved to /var/cache/conftool/dbconfig/20220221-190801-marostegui.json [19:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:16] (03PS21) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [19:10:19] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:10:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33905/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:13:53] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:14:13] (03PS22) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [19:15:12] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:15:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33906/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:16:13] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:23:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300381)', diff saved to https://phabricator.wikimedia.org/P21216 and previous config saved to /var/cache/conftool/dbconfig/20220221-192309-marostegui.json [19:23:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:23:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:19] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:33] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [19:25:36] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [19:27:55] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [19:28:16] (03PS23) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [19:28:18] (03PS1) 10Jbond: O:netbox::standalone: remove netboxdb2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/764438 [19:28:57] (03CR) 10jerkins-bot: [V: 04-1] P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:30:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33907/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:30:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:30:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:30:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [19:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:38] (03PS24) 10Jbond: P:netbox: tidy up netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/764330 [19:34:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33908/console" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:38:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:38:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:38:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21217 and previous config saved to /var/cache/conftool/dbconfig/20220221-193842-marostegui.json [19:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:53] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:40:07] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [19:44:20] (03CR) 10Jbond: [V: 03+1] "most recent pcc is essentially a no-op" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:44:59] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [19:46:59] (03CR) 10Jbond: [V: 03+1] "Also need to rename hiera keys in the private repo simlar to" [puppet] - 10https://gerrit.wikimedia.org/r/764330 (owner: 10Jbond) [19:50:19] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:51:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21218 and previous config saved to /var/cache/conftool/dbconfig/20220221-195147-marostegui.json [19:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:53] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:56:43] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [19:59:00] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [20:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21219 and previous config saved to /var/cache/conftool/dbconfig/20220221-200651-marostegui.json [20:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P21220 and previous config saved to /var/cache/conftool/dbconfig/20220221-202156-marostegui.json [20:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300381)', diff saved to https://phabricator.wikimedia.org/P21221 and previous config saved to /var/cache/conftool/dbconfig/20220221-203701-marostegui.json [20:37:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [20:37:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [20:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:08] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [20:37:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21222 and previous config saved to /var/cache/conftool/dbconfig/20220221-203708-marostegui.json [20:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:54] (03PS1) 10Bartosz Dziewoński: Don't suppress teardown prompt when pressing escape [extensions/VisualEditor] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/764396 (https://phabricator.wikimedia.org/T302096) [20:48:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21223 and previous config saved to /var/cache/conftool/dbconfig/20220221-204849-marostegui.json [20:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:56] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [20:59:38] jouncebot: next [20:59:39] In 11 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220222T0800) [21:03:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21224 and previous config saved to /var/cache/conftool/dbconfig/20220221-210354-marostegui.json [21:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P21225 and previous config saved to /var/cache/conftool/dbconfig/20220221-211859-marostegui.json [21:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:50] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:32:25] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300381)', diff saved to https://phabricator.wikimedia.org/P21226 and previous config saved to /var/cache/conftool/dbconfig/20220221-213403-marostegui.json [21:34:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [21:34:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [21:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:11] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [21:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21227 and previous config saved to /var/cache/conftool/dbconfig/20220221-213411-marostegui.json [21:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:17] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 142 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:41:40] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:41:55] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:42:41] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21228 and previous config saved to /var/cache/conftool/dbconfig/20220221-214500-marostegui.json [21:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:07] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [21:49:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:51:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:52:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:54:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:58:13] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:59:21] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on cloudstore1008 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21229 and previous config saved to /var/cache/conftool/dbconfig/20220221-220005-marostegui.json [22:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P21230 and previous config saved to /var/cache/conftool/dbconfig/20220221-221510-marostegui.json [22:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:37] PROBLEM - Check systemd state on durum6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service,ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:03] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:37] PROBLEM - Check systemd state on durum6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:19] PROBLEM - Check systemd state on doh6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:32] (03CR) 10MSantos: [C: 03+1] maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664) (owner: 10Hnowlan) [22:29:15] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [22:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300381)', diff saved to https://phabricator.wikimedia.org/P21231 and previous config saved to /var/cache/conftool/dbconfig/20220221-223015-marostegui.json [22:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:22] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [22:34:11] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [22:35:36] (Processor usage over 85%) firing: Alert for device scs-oe16-esams.mgmt.esams.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [22:48:57] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [22:51:23] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [23:06:09] PROBLEM - Number of messages locally queued by purged for processing on cp6010 is CRITICAL: cluster=cache_text instance=cp6010 job=purged layer=backend site=drmrs https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [23:11:03] RECOVERY - Number of messages locally queued by purged for processing on cp6010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [23:25:36] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [23:47:34] (03CR) 10Huji: "Thanks @Daimona." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [23:55:53] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook