[00:03:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:04:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [00:04:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [00:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28010 and previous config saved to /var/cache/conftool/dbconfig/20220519-000423-ladsgroup.json [00:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:28] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [00:06:00] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:08] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28011 and previous config saved to /var/cache/conftool/dbconfig/20220519-001319-ladsgroup.json [00:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:26] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [00:21:10] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:14] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28012 and previous config saved to /var/cache/conftool/dbconfig/20220519-002824-ladsgroup.json [00:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:58] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28013 and previous config saved to /var/cache/conftool/dbconfig/20220519-003536-ladsgroup.json [00:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:42] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [00:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P28014 and previous config saved to /var/cache/conftool/dbconfig/20220519-004329-ladsgroup.json [00:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28015 and previous config saved to /var/cache/conftool/dbconfig/20220519-005041-ladsgroup.json [00:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28016 and previous config saved to /var/cache/conftool/dbconfig/20220519-005834-ladsgroup.json [00:58:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [00:58:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [00:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:40] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [00:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [01:05:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [01:05:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [01:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [01:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P28017 and previous config saved to /var/cache/conftool/dbconfig/20220519-010546-ladsgroup.json [01:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:48] (03PS1) 10Cathal Mooney: Update routing policies for cloudgw devices [homer/public] - 10https://gerrit.wikimedia.org/r/793134 (https://phabricator.wikimedia.org/T304989) [01:11:29] (03CR) 10Cathal Mooney: [C: 03+2] Update routing policies for cloudgw devices [homer/public] - 10https://gerrit.wikimedia.org/r/793134 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [01:11:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:11:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28018 and previous config saved to /var/cache/conftool/dbconfig/20220519-011143-ladsgroup.json [01:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:49] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [01:12:11] (03Merged) 10jenkins-bot: Update routing policies for cloudgw devices [homer/public] - 10https://gerrit.wikimedia.org/r/793134 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [01:20:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28019 and previous config saved to /var/cache/conftool/dbconfig/20220519-012015-ladsgroup.json [01:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:24] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [01:20:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28020 and previous config saved to /var/cache/conftool/dbconfig/20220519-012051-ladsgroup.json [01:20:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [01:20:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [01:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:56] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [01:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:28] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28021 and previous config saved to /var/cache/conftool/dbconfig/20220519-013521-ladsgroup.json [01:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28022 and previous config saved to /var/cache/conftool/dbconfig/20220519-015026-ladsgroup.json [01:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T303603)', diff saved to https://phabricator.wikimedia.org/P28023 and previous config saved to /var/cache/conftool/dbconfig/20220519-020532-ladsgroup.json [02:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [02:10:42] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:24:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [02:24:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [02:24:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [02:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [02:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:40] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: Maint done', diff saved to https://phabricator.wikimedia.org/P28024 and previous config saved to /var/cache/conftool/dbconfig/20220519-025710-root.json [02:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:00:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:03:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1122.eqiad.wmnet with reason: Maintenance [03:03:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1122.eqiad.wmnet with reason: Maintenance [03:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298560)', diff saved to https://phabricator.wikimedia.org/P28025 and previous config saved to /var/cache/conftool/dbconfig/20220519-030335-ladsgroup.json [03:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:41] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [03:09:58] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_navtiming.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P28026 and previous config saved to /var/cache/conftool/dbconfig/20220519-031214-root.json [03:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [03:12:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [03:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T303603)', diff saved to https://phabricator.wikimedia.org/P28027 and previous config saved to /var/cache/conftool/dbconfig/20220519-031303-ladsgroup.json [03:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [03:27:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P28028 and previous config saved to /var/cache/conftool/dbconfig/20220519-032718-root.json [03:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T303603)', diff saved to https://phabricator.wikimedia.org/P28029 and previous config saved to /var/cache/conftool/dbconfig/20220519-032855-ladsgroup.json [03:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:01] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [03:29:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:29:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:04] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:37:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:37:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P28030 and previous config saved to /var/cache/conftool/dbconfig/20220519-034222-root.json [03:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P28031 and previous config saved to /var/cache/conftool/dbconfig/20220519-034400-ladsgroup.json [03:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [03:49:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [03:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P28032 and previous config saved to /var/cache/conftool/dbconfig/20220519-035726-root.json [03:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 5%: Maint done', diff saved to https://phabricator.wikimedia.org/P28033 and previous config saved to /var/cache/conftool/dbconfig/20220519-035730-root.json [03:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:57:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T303603)', diff saved to https://phabricator.wikimedia.org/P28034 and previous config saved to /var/cache/conftool/dbconfig/20220519-035754-ladsgroup.json [03:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:00] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [03:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P28035 and previous config saved to /var/cache/conftool/dbconfig/20220519-035820-root.json [03:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P28036 and previous config saved to /var/cache/conftool/dbconfig/20220519-035905-ladsgroup.json [03:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:00:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [04:12:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [04:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T303603)', diff saved to https://phabricator.wikimedia.org/P28037 and previous config saved to /var/cache/conftool/dbconfig/20220519-041410-ladsgroup.json [04:14:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:16] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [04:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T303603)', diff saved to https://phabricator.wikimedia.org/P28038 and previous config saved to /var/cache/conftool/dbconfig/20220519-041418-ladsgroup.json [04:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [04:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [04:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T303603)', diff saved to https://phabricator.wikimedia.org/P28039 and previous config saved to /var/cache/conftool/dbconfig/20220519-041427-ladsgroup.json [04:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [04:25:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [04:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T303603)', diff saved to https://phabricator.wikimedia.org/P28040 and previous config saved to /var/cache/conftool/dbconfig/20220519-043057-ladsgroup.json [04:30:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [04:31:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [04:31:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:04] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [04:31:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:31:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T303603)', diff saved to https://phabricator.wikimedia.org/P28041 and previous config saved to /var/cache/conftool/dbconfig/20220519-043110-ladsgroup.json [04:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T303603)', diff saved to https://phabricator.wikimedia.org/P28042 and previous config saved to /var/cache/conftool/dbconfig/20220519-043139-ladsgroup.json [04:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [04:37:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [04:37:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [04:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [04:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:20] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:38:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [04:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28043 and previous config saved to /var/cache/conftool/dbconfig/20220519-043858-ladsgroup.json [04:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:04] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:40:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:40:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P28044 and previous config saved to /var/cache/conftool/dbconfig/20220519-044644-ladsgroup.json [04:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303603)', diff saved to https://phabricator.wikimedia.org/P28045 and previous config saved to /var/cache/conftool/dbconfig/20220519-044805-ladsgroup.json [04:48:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:48:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:11] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [04:48:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T303603)', diff saved to https://phabricator.wikimedia.org/P28046 and previous config saved to /var/cache/conftool/dbconfig/20220519-044813-ladsgroup.json [04:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [04:54:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [04:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T303603)', diff saved to https://phabricator.wikimedia.org/P28047 and previous config saved to /var/cache/conftool/dbconfig/20220519-045412-ladsgroup.json [04:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:18] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P28048 and previous config saved to /var/cache/conftool/dbconfig/20220519-050149-ladsgroup.json [05:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:28] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Marina Azevedo to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/793061 (https://phabricator.wikimedia.org/T308603) (owner: 10Marostegui) [05:04:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303603)', diff saved to https://phabricator.wikimedia.org/P28049 and previous config saved to /var/cache/conftool/dbconfig/20220519-050404-ladsgroup.json [05:04:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [05:04:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [05:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:11] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T303603)', diff saved to https://phabricator.wikimedia.org/P28050 and previous config saved to /var/cache/conftool/dbconfig/20220519-050412-ladsgroup.json [05:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:00] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf LDAP Group for Mazevedo - https://phabricator.wikimedia.org/T308603 (10Marostegui) 05Open→03Resolved a:03Marostegui Added to LDAP wmf group Added to analytics_privatedata_users Added to wmf-nda Phabricator group Please give it aro... [05:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T303603)', diff saved to https://phabricator.wikimedia.org/P28051 and previous config saved to /var/cache/conftool/dbconfig/20220519-050738-ladsgroup.json [05:07:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:07:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [05:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T303603)', diff saved to https://phabricator.wikimedia.org/P28052 and previous config saved to /var/cache/conftool/dbconfig/20220519-050746-ladsgroup.json [05:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:46] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:15:36] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/793142 (https://phabricator.wikimedia.org/T307673) [05:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T303603)', diff saved to https://phabricator.wikimedia.org/P28053 and previous config saved to /var/cache/conftool/dbconfig/20220519-051654-ladsgroup.json [05:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:16:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [05:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:01] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:17:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T303603)', diff saved to https://phabricator.wikimedia.org/P28054 and previous config saved to /var/cache/conftool/dbconfig/20220519-051702-ladsgroup.json [05:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T303603)', diff saved to https://phabricator.wikimedia.org/P28055 and previous config saved to /var/cache/conftool/dbconfig/20220519-052039-ladsgroup.json [05:20:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:20:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [05:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T303603)', diff saved to https://phabricator.wikimedia.org/P28056 and previous config saved to /var/cache/conftool/dbconfig/20220519-052047-ladsgroup.json [05:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28057 and previous config saved to /var/cache/conftool/dbconfig/20220519-052218-ladsgroup.json [05:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303603)', diff saved to https://phabricator.wikimedia.org/P28058 and previous config saved to /var/cache/conftool/dbconfig/20220519-052303-ladsgroup.json [05:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:23:16] (03PS2) 10KartikMistry: Enable Section Translation in as, gu, kn, mk and, mr Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792559 (https://phabricator.wikimedia.org/T304828) [05:24:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s1 T301312 [05:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:26] T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312 [05:24:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s1 T301312 [05:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1163 with weight 0 T301312', diff saved to https://phabricator.wikimedia.org/P28059 and previous config saved to /var/cache/conftool/dbconfig/20220519-052517-ladsgroup.json [05:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:34] (03CR) 10Ladsgroup: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [05:32:39] (03PS2) 10Ladsgroup: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) [05:32:46] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [05:33:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T303603)', diff saved to https://phabricator.wikimedia.org/P28060 and previous config saved to /var/cache/conftool/dbconfig/20220519-053344-ladsgroup.json [05:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:50] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:48:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28061 and previous config saved to /var/cache/conftool/dbconfig/20220519-054849-ladsgroup.json [05:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28062 and previous config saved to /var/cache/conftool/dbconfig/20220519-055545-ladsgroup.json [05:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:51] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T0600). [06:00:07] o/ [06:00:08] !log Starting s1 eqiad failover from db1118 to db1163 - T301312 [06:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:14] T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312 [06:00:23] o/ [06:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T301312', diff saved to https://phabricator.wikimedia.org/P28063 and previous config saved to /var/cache/conftool/dbconfig/20220519-060023-ladsgroup.json [06:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:33] ro confirmed [06:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T301312', diff saved to https://phabricator.wikimedia.org/P28064 and previous config saved to /var/cache/conftool/dbconfig/20220519-060119-ladsgroup.json [06:01:20] topology looking good [06:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:33] I can write again [06:01:45] done \o/ [06:01:59] I have cleaned up orchestrator [06:03:01] updated query killer [06:03:19] (03PS2) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) [06:03:22] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [06:03:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [06:03:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28065 and previous config saved to /var/cache/conftool/dbconfig/20220519-060354-ladsgroup.json [06:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1118 T301312', diff saved to https://phabricator.wikimedia.org/P28066 and previous config saved to /var/cache/conftool/dbconfig/20220519-060542-ladsgroup.json [06:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:47] T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312 [06:05:55] marostegui: db1118 is depooled all yours [06:06:00] do you want to do the honors? [06:06:28] Amir1: Should I run all the schema changes I have pending? [06:06:43] let me fix its weight first [06:07:12] whatever you like, if you want me to do the ones assigned to me first, you want to do all, bullseye upgrade, etc. [06:07:34] Amir1: do yours first, as I am on clinic duty [06:07:43] ah yeah [06:07:52] my condolences [06:07:57] let me know once you are fully done with your schema changes [06:08:19] sure [06:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28067 and previous config saved to /var/cache/conftool/dbconfig/20220519-061050-ladsgroup.json [06:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1118.eqiad.wmnet with reason: Maint [06:13:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1118.eqiad.wmnet with reason: Maint [06:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:58] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T303603)', diff saved to https://phabricator.wikimedia.org/P28068 and previous config saved to /var/cache/conftool/dbconfig/20220519-061859-ladsgroup.json [06:19:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:19:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:06] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [06:19:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T303603)', diff saved to https://phabricator.wikimedia.org/P28069 and previous config saved to /var/cache/conftool/dbconfig/20220519-061907-ladsgroup.json [06:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:53] (03PS2) 10Ladsgroup: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) [06:19:56] (03CR) 10Ladsgroup: [C: 03+2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [06:19:58] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [06:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P28070 and previous config saved to /var/cache/conftool/dbconfig/20220519-062555-ladsgroup.json [06:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T303603)', diff saved to https://phabricator.wikimedia.org/P28071 and previous config saved to /var/cache/conftool/dbconfig/20220519-063452-ladsgroup.json [06:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:58] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [06:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28072 and previous config saved to /var/cache/conftool/dbconfig/20220519-064100-ladsgroup.json [06:41:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:07] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:41:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28073 and previous config saved to /var/cache/conftool/dbconfig/20220519-064108-ladsgroup.json [06:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:16] !log dbmaint s6@eqiad T298557 [06:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:23] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:42:08] !log dbmaint s1@eqiad T298557 [06:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:44:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] netops: move network routers/devices definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [06:49:00] <_joe_> jouncebot: nowandnext [06:49:00] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [06:49:00] In 0 hour(s) and 10 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T0700) [06:49:45] (03PS2) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/793142 (https://phabricator.wikimedia.org/T307673) [06:49:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P28074 and previous config saved to /var/cache/conftool/dbconfig/20220519-064957-ladsgroup.json [06:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:22] (03PS14) 10Filippo Giunchedi: netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [06:54:59] (03CR) 10jerkins-bot: [V: 04-1] netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [06:58:05] _joe_: we've backport deployment. Not sure why jouncebot not showing it. [06:58:14] jouncebot: refresh [06:58:14] I refreshed my knowledge about deployments. [06:58:24] jouncebot: next [06:58:24] In 0 hour(s) and 1 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T0700) [06:58:26] <_joe_> kart_: it does, actually [06:58:30] <_joe_> it did [06:58:39] oh. 'now' :) [06:59:28] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35398/console" [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:00:04] Amir1 and apergos: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:06] morning. no trainees signed up today, only the one patch in the window, looks fine to me, kart_ are you going to self deploy? [07:00:14] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers for routinator/diffscan/bgpalerter/gobgpd/homer [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:01:09] apergos: yes. self deploy. [07:01:18] okey dokey! go for it [07:01:50] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:03] (03CR) 10Filippo Giunchedi: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:02:24] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation in as, gu, kn, mk and, mr Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792559 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [07:03:09] (03Merged) 10jenkins-bot: Enable Section Translation in as, gu, kn, mk and, mr Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792559 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [07:04:44] (03CR) 10Zabe: vagrant: add shebang to alias-vagrant-profile-d.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792694 (owner: 10Zabe) [07:05:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P28075 and previous config saved to /var/cache/conftool/dbconfig/20220519-070502-ladsgroup.json [07:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:05:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28076 and previous config saved to /var/cache/conftool/dbconfig/20220519-070533-marostegui.json [07:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:06:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:09] Looks good on mwdebug1001. [07:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:37] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792559|Enable Section Translation in as, gu, kn, mk and, mr Wikipedias (T304828)]] (duration: 00m 53s) [07:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:41] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [07:07:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:01] apergos: I'm done :) [07:08:10] zoooooom! [07:08:12] ok then [07:08:37] anyone else that wants to get a patch in, step up, otherwise I'm gonna wander off after about 10 minutes of waiting [07:08:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:09:43] (03Abandoned) 10Muehlenhoff: Remove wiki-mail-codfw [dns] - 10https://gerrit.wikimedia.org/r/723432 (owner: 10Muehlenhoff) [07:10:06] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] netops: ping core routers from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [07:10:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:10:36] (03CR) 10Slyngshede: [C: 03+2] Remove old cron calls. [puppet] - 10https://gerrit.wikimedia.org/r/793040 (https://phabricator.wikimedia.org/T790325) (owner: 10Slyngshede) [07:11:58] slyngs: I think we might have merged each other's changes! (totally fine of course) [07:12:00] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_navtiming.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:19] Yep, I was just about to write [07:12:24] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:12:31] Mine are just cleanups, so no problem [07:12:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:12:57] yeah mine is introducing new stuff, also no problem (in theory) [07:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:03] (03PS5) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [07:18:28] !log dbmaint s1@eqiad T300381 [07:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:33] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:20:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T303603)', diff saved to https://phabricator.wikimedia.org/P28077 and previous config saved to /var/cache/conftool/dbconfig/20220519-072007-ladsgroup.json [07:20:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:20:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:13] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:16] (03CR) 10Muehlenhoff: "Looks good, a few nits inline. And let's summarise the changes in the commit message, since this changes does multiple things together (ty" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [07:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:17] (03PS6) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [07:24:21] !log hashar@deploy1002 Started deploy [integration/docroot@8615678]: Fix links to non-existent Grafana graphs - T307405 [07:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:26] T307405: Broken dashboard links on Zuul Status page - https://phabricator.wikimedia.org/T307405 [07:24:30] !log hashar@deploy1002 Finished deploy [integration/docroot@8615678]: Fix links to non-existent Grafana graphs - T307405 (duration: 00m 09s) [07:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:57] (03CR) 10Muehlenhoff: [C: 03+2] vagrant: add shebang to alias-vagrant-profile-d.sh [puppet] - 10https://gerrit.wikimedia.org/r/792694 (owner: 10Zabe) [07:25:14] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:24] (03PS1) 10Gergő Tisza: GrothExperiments: Enable Add Link frontend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793395 (https://phabricator.wikimedia.org/T304542) [07:26:39] 10SRE, 10vm-requests: eqiad/codfw: 1 of VMs requested for MX - https://phabricator.wikimedia.org/T286208 (10Marostegui) was this done? [07:29:02] 10SRE, 10vm-requests: eqiad/codfw: 1 of VMs requested for MX - https://phabricator.wikimedia.org/T286208 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Oh, yes. This is complete. These VMs have already been decommisioned in the mean time :-) (They were used for Bullseye update tests) [07:32:32] (03PS8) 10Slyngshede: Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 [07:32:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:32:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:06] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:40] (03PS1) 10Muehlenhoff: striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) [07:39:42] (03PS1) 10Muehlenhoff: bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) [07:39:44] (03PS1) 10Muehlenhoff: gitlab/gitlab_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) [07:39:46] (03PS1) 10Muehlenhoff: dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) [07:39:48] (03PS1) 10Muehlenhoff: purged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793401 [07:40:05] (03CR) 10Slyngshede: "Typos fixed and add a more detailed commit message." [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [07:42:27] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [07:42:52] (03CR) 10Slyngshede: [C: 03+2] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:43:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1007.eqiad.wmnet [07:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:45:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T303603)', diff saved to https://phabricator.wikimedia.org/P28078 and previous config saved to /var/cache/conftool/dbconfig/20220519-074538-ladsgroup.json [07:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:43] (03PS1) 10Mainframe98: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) [07:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [07:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28079 and previous config saved to /var/cache/conftool/dbconfig/20220519-074748-ladsgroup.json [07:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:54] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:48:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1007.eqiad.wmnet [07:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28080 and previous config saved to /var/cache/conftool/dbconfig/20220519-075046-marostegui.json [07:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:53:20] (03PS1) 10Muehlenhoff: Remove webperf1001/webperf2001 [puppet] - 10https://gerrit.wikimedia.org/r/793403 (https://phabricator.wikimedia.org/T305460) [07:57:11] (03PS2) 10Muehlenhoff: bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) [07:58:09] (03CR) 10Slyngshede: [C: 03+2] Modernize aptrepo module. [puppet] - 10https://gerrit.wikimedia.org/r/792975 (owner: 10Slyngshede) [08:00:05] jnuche and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T0800). [08:00:41] hi, train is currently blocked on https://phabricator.wikimedia.org/T308691 [08:00:46] no deploy for the moment [08:02:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28081 and previous config saved to /var/cache/conftool/dbconfig/20220519-080253-ladsgroup.json [08:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T303603)', diff saved to https://phabricator.wikimedia.org/P28082 and previous config saved to /var/cache/conftool/dbconfig/20220519-080427-ladsgroup.json [08:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:33] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:05:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28083 and previous config saved to /var/cache/conftool/dbconfig/20220519-080551-marostegui.json [08:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:04] (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793142 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [08:06:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2061.codfw.wmnet with OS bullseye [08:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:31] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2061.codfw.wmnet with OS bullseye [08:06:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/793142 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [08:06:57] !log Failover m1 master T307673 [08:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:52] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/793405 (https://phabricator.wikimedia.org/T307673) [08:09:01] (03CR) 10Marostegui: [C: 04-2] "not yet" [dns] - 10https://gerrit.wikimedia.org/r/793405 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [08:10:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1001/webperf2001 [puppet] - 10https://gerrit.wikimedia.org/r/793403 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [08:12:37] Lucas_WMDE: Guten Tag, do you know whether Tiemo is around? Jnuche is running the train this week and we got a blocker with FileImporter https://phabricator.wikimedia.org/T308691 [08:12:49] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10Marostegui) Any ETA? We've got the host depooled for now but I would like to repool it before the weekend if possible. [08:13:24] not sure, he should be in a meeting with me at the moment but isn’t there yet [08:14:17] otherwise, I wouldn’t expect any WMDE people after 12:00 UTC or so today, there’s a WMDE-wide event in the afternoon [08:14:22] but maybe he’ll be there until then [08:14:27] if I see him I’ll let him know [08:14:53] (03PS1) 10Marostegui: db1164: Disable notificatins [puppet] - 10https://gerrit.wikimedia.org/r/793407 [08:16:20] Lucas_WMDE: danke :) [08:16:30] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf2001.codfw.wmnet [08:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28084 and previous config saved to /var/cache/conftool/dbconfig/20220519-081758-ladsgroup.json [08:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:47] hashar: I’m being told some people are on it :) [08:18:54] (and Thiemo just updated the task a bit) [08:19:14] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P28085 and previous config saved to /var/cache/conftool/dbconfig/20220519-081932-ladsgroup.json [08:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:02] Lucas_WMDE: thx! [08:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28086 and previous config saved to /var/cache/conftool/dbconfig/20220519-082056-marostegui.json [08:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notificatins [puppet] - 10https://gerrit.wikimedia.org/r/793407 (owner: 10Marostegui) [08:22:12] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix gitlab-ce apt component on bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/793046 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [08:22:35] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:54] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Marostegui) @Ottomata does this user need access to analytics-private-user? [08:27:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:27:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf2001.codfw.wmnet [08:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:28] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `webperf2001.codfw.wmnet` - webperf2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga... [08:28:19] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf1001.eqiad.wmnet [08:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:41] (03CR) 10Gehel: elastic: add reimage to rolling-operation (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [08:32:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10Marostegui) @BBlack anything left here or can this be closed? [08:33:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298555)', diff saved to https://phabricator.wikimedia.org/P28087 and previous config saved to /var/cache/conftool/dbconfig/20220519-083303-ladsgroup.json [08:33:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:33:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [08:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:10] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:33:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28088 and previous config saved to /var/cache/conftool/dbconfig/20220519-083311-ladsgroup.json [08:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:14] 10SRE, 10serviceops: Jenkins fails onCI puppet with: EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/pkg-resources/ - https://phabricator.wikimedia.org/T279307 (10Marostegui) 05Open→03Resolved This was finally merged so I am considering this resolved. Reopen if needed [08:34:19] (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793405 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [08:34:23] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1009.eqiad.wmnet [08:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:27] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/793405 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [08:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P28089 and previous config saved to /var/cache/conftool/dbconfig/20220519-083437-ladsgroup.json [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:47] !log Failover m2 master T307673 [08:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28090 and previous config saved to /var/cache/conftool/dbconfig/20220519-083601-marostegui.json [08:36:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:36:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298557)', diff saved to https://phabricator.wikimedia.org/P28091 and previous config saved to /var/cache/conftool/dbconfig/20220519-083609-marostegui.json [08:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:51] 10SRE, 10Sustainability (Incident Followup): 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Marostegui) 05Open→03Resolved a:03Marostegui Considering this fixed and the Incident Report is at https://wikitech.wikimedia.org/wiki/Incidents/2021-03-14_MediaWiki_API Reopen if... [08:37:16] (03CR) 10Gehel: "I have mixed feelings about this. As an operator, I want to know about things that require my attention (so WARN and higher). I'm suspicio" [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson) [08:37:47] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10Marostegui) [08:37:53] (03CR) 10Gehel: [C: 03+1] elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:38:17] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2061.codfw.wmnet with OS bullseye [08:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:22] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2061.codfw.wmnet with OS bullseye executed with errors: - ms-be2061 (**FAIL**)... [08:38:36] 10SRE, 10SRE-OnFire, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10Marostegui) @lmata what should we do with old task? [08:38:42] (03PS1) 10Jelto: idp: add gitlab-replica-new to idp [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) [08:39:11] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1009.eqiad.wmnet [08:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:33] 10SRE: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10Marostegui) @MoritzMuehlenhoff good to close? [08:39:50] (03PS7) 10Muehlenhoff: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:40:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf1001.eqiad.wmnet [08:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:12] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `webperf1001.eqiad.wmnet` - webperf1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [08:42:09] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Marostegui) @Vgutierrez is it worth keeping this task open? [08:42:23] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet [08:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:50] (03CR) 10Jelto: "I'm not sure if we need a dedicated entry here for gitlab-replica-new, as this is temporary and for migration only. I thought maybe we can" [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [08:43:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1008.eqiad.wmnet [08:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:44] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez I believe that we can safely close this one now as we moved away from ats-tls [08:46:57] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet [08:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:00] (03PS1) 10Thiemo Kreuz (WMDE): Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) [08:48:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2061.codfw.wmnet with OS bullseye [08:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:11] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2061.codfw.wmnet with OS bullseye [08:48:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1008.eqiad.wmnet [08:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:51] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet [08:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2061.codfw.wmnet with reason: host reimage [08:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T303603)', diff saved to https://phabricator.wikimedia.org/P28092 and previous config saved to /var/cache/conftool/dbconfig/20220519-084942-ladsgroup.json [08:49:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:49:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:49:48] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T303603)', diff saved to https://phabricator.wikimedia.org/P28093 and previous config saved to /var/cache/conftool/dbconfig/20220519-084956-ladsgroup.json [08:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:37] (03CR) 10Jbond: idp: add gitlab-replica-new to idp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [08:51:03] 10SRE, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance Issue: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26, 2018) - https://phabricator.wikimedia.org/T208231 (10Marostegui) 05Open→03Resolved I am going to close this as fixed, please reopen if needed. [08:53:31] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet [08:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2061.codfw.wmnet with reason: host reimage [08:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 2 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10Marostegui) @lmata thoughts? [08:55:20] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1012.eqiad.wmnet [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:34] 10SRE, 10MediaWiki-General, 10Sustainability (Incident Followup): Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475 (10Marostegui) 05Open→03Declined This is impossible to investigate anymore - closing it. Reopen if needed. [08:56:19] 10SRE, 10Cloud-Services, 10DBA, 10Infrastructure-Foundations, and 2 others: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 (10Marostegui) [08:57:15] (03CR) 10Gehel: [C: 03+1] Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [08:58:51] (03PS2) 10Jelto: idp: add gitlab-replica-new to idp [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) [08:59:15] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:41] (03CR) 10Jelto: idp: add gitlab-replica-new to idp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:00:15] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:00:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298557)', diff saved to https://phabricator.wikimedia.org/P28094 and previous config saved to /var/cache/conftool/dbconfig/20220519-090044-marostegui.json [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:50] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:01:14] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1012.eqiad.wmnet [09:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [09:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:40] (03CR) 10jerkins-bot: [V: 04-1] Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [09:03:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5002.eqsin.wmnet with OS bullseye [09:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:34] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5002.eqsin.wmnet with OS bullseye [09:03:39] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1013.eqiad.wmnet [09:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:45] 10SRE: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [09:05:16] 10SRE: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been completed for quite some time. [09:06:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [09:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:21] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:07:54] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/793414 (https://phabricator.wikimedia.org/T307673) [09:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T303603)', diff saved to https://phabricator.wikimedia.org/P28095 and previous config saved to /var/cache/conftool/dbconfig/20220519-090756-ladsgroup.json [09:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:02] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:08:24] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1013.eqiad.wmnet [09:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2061.codfw.wmnet with OS bullseye [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2061.codfw.wmnet with OS bullseye completed: - ms-be2061 (**PASS**) - Removed... [09:11:07] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet [09:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:16] (03PS2) 10Thiemo Kreuz (WMDE): Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) [09:12:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH I've drained all primary instances away from ganeti4002. Before you swap the DIMM simply set downtime and power the server down. And when t... [09:15:03] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1014.eqiad.wmnet [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28096 and previous config saved to /var/cache/conftool/dbconfig/20220519-091549-marostegui.json [09:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:56] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet [09:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:18] (03PS1) 10Giuseppe Lavagetto: deployment_server: switch the deployment group to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) [09:17:20] (03PS1) 10Giuseppe Lavagetto: deployment_server: use helm_user_group everywhere for consistency [puppet] - 10https://gerrit.wikimedia.org/r/793417 [09:17:22] (03PS1) 10Giuseppe Lavagetto: mediawiki::system_users: add mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) [09:19:25] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:20:57] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1015.eqiad.wmnet [09:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:23:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P28097 and previous config saved to /var/cache/conftool/dbconfig/20220519-092301-ladsgroup.json [09:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:26:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [09:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:30:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28098 and previous config saved to /var/cache/conftool/dbconfig/20220519-093054-marostegui.json [09:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [09:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:05] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:33:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28099 and previous config saved to /var/cache/conftool/dbconfig/20220519-093326-ladsgroup.json [09:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:32] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [09:35:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5002.eqsin.wmnet with reason: host reimage [09:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] (03CR) 10ArielGlenn: Add Clarkson university host to list of dumps mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [09:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P28100 and previous config saved to /var/cache/conftool/dbconfig/20220519-093806-ladsgroup.json [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5002.eqsin.wmnet with reason: host reimage [09:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:32] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:42:42] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:44:57] 10SRE, 10Infrastructure-Foundations, 10netops: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10Marostegui) 05Open→03Resolved a:03ayounsi Fixed per the above comment [09:46:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298557)', diff saved to https://phabricator.wikimedia.org/P28101 and previous config saved to /var/cache/conftool/dbconfig/20220519-094559-marostegui.json [09:46:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:46:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:06] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28102 and previous config saved to /var/cache/conftool/dbconfig/20220519-094607-marostegui.json [09:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28103 and previous config saved to /var/cache/conftool/dbconfig/20220519-094831-ladsgroup.json [09:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T303603)', diff saved to https://phabricator.wikimedia.org/P28104 and previous config saved to /var/cache/conftool/dbconfig/20220519-095311-ladsgroup.json [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:53:17] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:53:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:53:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1000). [10:00:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [10:00:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5002.eqsin.wmnet with OS bullseye [10:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:42] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5002.eqsin.wmnet with OS bullseye completed: - ganeti5002 (**PASS**) - Downtimed on Icinga/Aler... [10:01:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [10:01:27] <_joe_> jouncebot: next [10:01:27] In 2 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1300) [10:03:21] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793121 (https://phabricator.wikimedia.org/T308672) (owner: 10Dwisehaupt) [10:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28105 and previous config saved to /var/cache/conftool/dbconfig/20220519-100336-ladsgroup.json [10:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35399/console" [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [10:07:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:07:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28106 and previous config saved to /var/cache/conftool/dbconfig/20220519-100725-ladsgroup.json [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:32] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:07:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28107 and previous config saved to /var/cache/conftool/dbconfig/20220519-101108-marostegui.json [10:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:14] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:16:01] (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793414 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [10:18:40] !log Failover m3 master T307673 [10:18:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298555)', diff saved to https://phabricator.wikimedia.org/P28108 and previous config saved to /var/cache/conftool/dbconfig/20220519-101841-ladsgroup.json [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/793414 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [10:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:18:56] (03PS1) 10Majavah: P:wikidough: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793421 [10:19:19] (03PS2) 10Majavah: P:wikidough: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793421 (https://phabricator.wikimedia.org/T308601) [10:20:04] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35400/console" [puppet] - 10https://gerrit.wikimedia.org/r/793421 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:21:03] (03CR) 10Majavah: P:wikidough: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793421 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:22:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [10:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:51] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:24:14] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/793422 (https://phabricator.wikimedia.org/T307673) [10:25:34] (03CR) 10Majavah: [V: 03+1] monitoring: use nrpe::plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28109 and previous config saved to /var/cache/conftool/dbconfig/20220519-102613-marostegui.json [10:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [10:28:51] (03PS1) 10Majavah: openstack: remove unused check_ssl_certfile [puppet] - 10https://gerrit.wikimedia.org/r/793423 [10:31:21] godog: ^ do you want someone else to review/merge that patch? (or just forgot I can't merge puppet things myself?) [10:36:58] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Marostegui) @BBlack anything else pending? [10:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28110 and previous config saved to /var/cache/conftool/dbconfig/20220519-104119-marostegui.json [10:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:52] (03PS1) 10Jbond: labs - puppet_alert.py: Update script to output last log messages [puppet] - 10https://gerrit.wikimedia.org/r/793427 [10:54:08] (03CR) 10jerkins-bot: [V: 04-1] labs - puppet_alert.py: Update script to output last log messages [puppet] - 10https://gerrit.wikimedia.org/r/793427 (owner: 10Jbond) [10:54:52] (03PS1) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) [10:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28112 and previous config saved to /var/cache/conftool/dbconfig/20220519-105624-marostegui.json [10:56:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:56:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:56:30] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:56:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298557)', diff saved to https://phabricator.wikimedia.org/P28113 and previous config saved to /var/cache/conftool/dbconfig/20220519-105637-marostegui.json [10:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:43] (03PS2) 10Jbond: labs - puppet_alert.py: Update script to output last log messages [puppet] - 10https://gerrit.wikimedia.org/r/793427 [10:59:51] (03CR) 10Jbond: P:etcd::tlsproxy: move to cfssl pki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [11:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:00] (03PS1) 10Majavah: replace wmflib's ensuremounted with stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/793430 (https://phabricator.wikimedia.org/T308639) [11:05:16] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35401/console" [puppet] - 10https://gerrit.wikimedia.org/r/793430 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [11:06:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [11:07:25] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793096 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [11:07:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28114 and previous config saved to /var/cache/conftool/dbconfig/20220519-110740-ladsgroup.json [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:46] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:08:23] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793102 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [11:16:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:17:01] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:18:33] taavi: I'll merge, I got called to lunch :) [11:18:46] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: use nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793099 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [11:19:41] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jbond) > Not directly because of the datacenter-ops group but you get it from the LDAP ops group and John is in that group. So that should work For m... [11:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298557)', diff saved to https://phabricator.wikimedia.org/P28115 and previous config saved to /var/cache/conftool/dbconfig/20220519-112006-marostegui.json [11:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:13] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:20:24] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @AlexisJazz for that suggestion. I think it might well help in terms of findi... [11:21:04] (03Abandoned) 10Btullis: Increase the connect_timeout for eventgate based services [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 (owner: 10Btullis) [11:21:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793399 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:21:49] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:22:37] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:22:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jbond) >>! In T308013#7920581, @hashar wrote: > Before October 1st 2012, the code is my own and per my contract at the time: //"source code contributed as part of this con... [11:22:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28116 and previous config saved to /var/cache/conftool/dbconfig/20220519-112245-ladsgroup.json [11:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:54] (03PS2) 10Slyngshede: WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) [11:23:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [11:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:20] Lucas_WMDE: hi, sorry to bother you again, but the train is still stuck on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FileImporter/+/793157 [11:24:24] do you know of anyone who can help move that patch forward? [11:28:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [11:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:04] I think Thiemo, Adam Wight and Fisch were working on it, but it looks like they’re not in this channel at the moment [11:34:02] (03PS1) 10Jbond: CONTRIBUTORS: add delegated licence permision in repo [puppet] - 10https://gerrit.wikimedia.org/r/793436 [11:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28117 and previous config saved to /var/cache/conftool/dbconfig/20220519-113511-marostegui.json [11:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:49] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 3 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) thanks for the follow-up, I agree with your assessment, and still an open risk, bumping scheduling. [11:36:04] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/793422 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [11:36:20] jbond: did you intend to +2 but not merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/793096? [11:36:33] (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793422 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [11:36:39] (03PS1) 10Jbond: ublam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 [11:37:19] taavi: no will merge now :) [11:37:31] (03CR) 10jerkins-bot: [V: 04-1] ublam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 (owner: 10Jbond) [11:37:41] done [11:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28118 and previous config saved to /var/cache/conftool/dbconfig/20220519-113750-ladsgroup.json [11:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:02] (03PS2) 10Jbond: blam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 [11:39:05] (03CR) 10jerkins-bot: [V: 04-1] blam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 (owner: 10Jbond) [11:39:10] (03CR) 10Jelto: [C: 03+2] idp: add gitlab-replica-new to idp [puppet] - 10https://gerrit.wikimedia.org/r/793409 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:39:52] (03PS3) 10Jbond: blam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 [11:42:38] (03PS1) 10KartikMistry: Enable ContentTranslation as default for cs, el, he, ko and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793444 (https://phabricator.wikimedia.org/T298239) [11:47:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298560)', diff saved to https://phabricator.wikimedia.org/P28119 and previous config saved to /var/cache/conftool/dbconfig/20220519-114703-ladsgroup.json [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:10] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [11:49:59] (03CR) 10Volans: [C: 03+1] "I agree with the idea that tracking this in the repo itself is much better for trackability and auditing compared to Phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/793436 (owner: 10Jbond) [11:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28120 and previous config saved to /var/cache/conftool/dbconfig/20220519-115016-marostegui.json [11:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:39] (03CR) 10Muehlenhoff: "One remaining comment inline, but looks good otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [11:52:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28121 and previous config saved to /var/cache/conftool/dbconfig/20220519-115255-ladsgroup.json [11:52:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:52:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:01] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:53:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28122 and previous config saved to /var/cache/conftool/dbconfig/20220519-115303-ladsgroup.json [11:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:07] (03PS1) 10Ammarpad: annualreport: update redirect to 2020-2021 report [puppet] - 10https://gerrit.wikimedia.org/r/793447 (https://phabricator.wikimedia.org/T308737) [11:54:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [11:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:46] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/793422 (https://phabricator.wikimedia.org/T307673) (owner: 10Marostegui) [11:59:14] !log Failover m5 master T307673 [11:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [12:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28123 and previous config saved to /var/cache/conftool/dbconfig/20220519-120209-ladsgroup.json [12:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:13] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298557)', diff saved to https://phabricator.wikimedia.org/P28124 and previous config saved to /var/cache/conftool/dbconfig/20220519-120521-marostegui.json [12:05:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [12:05:27] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:05:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [12:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [12:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:35] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [12:08:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [12:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28125 and previous config saved to /var/cache/conftool/dbconfig/20220519-120917-ladsgroup.json [12:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:22] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:09:45] hashar, jnuche: Adam suggested that disabling FileImporter/Exporter over the weekend might be an acceptable option to unblock the train [12:09:51] otherwise I don’t have much to offer, sorry [12:13:32] (03PS2) 10Jbond: CONTRIBUTORS: add delegated licence permision in repo [puppet] - 10https://gerrit.wikimedia.org/r/793436 [12:14:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [12:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:29] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:15:41] (03PS2) 10Cathal Mooney: Adjust routing-options template for ASWs to enable ECMP always [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) [12:15:46] Lucas_WMD: thanks for the feedback, the patch is simple revert, wouldn't it be simpler to have it merged? [12:15:59] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:16:15] (03CR) 10Jbond: [C: 03+1] bird/fastnetmon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793398 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:16:21] (03CR) 10Jbond: [C: 03+1] striker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793397 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28126 and previous config saved to /var/cache/conftool/dbconfig/20220519-121714-ladsgroup.json [12:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] (03CR) 10Jbond: [C: 03+1] "LGTm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793417 (owner: 10Giuseppe Lavagetto) [12:18:03] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:19:23] (03CR) 10Muehlenhoff: "Looks good, some suggestions inline." [puppet] - 10https://gerrit.wikimedia.org/r/793436 (owner: 10Jbond) [12:20:32] (03PS3) 10Slyngshede: WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) [12:21:54] (03PS1) 10Cathal Mooney: Adjust custom-vrf template to support devices with 32-bit ASNs [homer/public] - 10https://gerrit.wikimedia.org/r/793449 (https://phabricator.wikimedia.org/T304989) [12:22:55] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:23:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:23:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5002.eqsin.wmnet [12:23:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:29] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:03] (03CR) 10Cathal Mooney: [C: 03+2] Adjust custom-vrf template to support devices with 32-bit ASNs [homer/public] - 10https://gerrit.wikimedia.org/r/793449 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P28127 and previous config saved to /var/cache/conftool/dbconfig/20220519-122422-ladsgroup.json [12:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:48] (03Merged) 10jenkins-bot: Adjust custom-vrf template to support devices with 32-bit ASNs [homer/public] - 10https://gerrit.wikimedia.org/r/793449 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:27:02] (03CR) 10Jaime Nuche: "Hi, thank you for the patch." [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [12:29:05] (03PS1) 10Jbond: systemd::sysuser: add ability to managehome of user [puppet] - 10https://gerrit.wikimedia.org/r/793450 [12:29:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35402/console" [puppet] - 10https://gerrit.wikimedia.org/r/793450 (owner: 10Jbond) [12:30:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::sysuser: add ability to managehome of user [puppet] - 10https://gerrit.wikimedia.org/r/793450 (owner: 10Jbond) [12:30:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298560)', diff saved to https://phabricator.wikimedia.org/P28128 and previous config saved to /var/cache/conftool/dbconfig/20220519-123219-ladsgroup.json [12:32:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1129.eqiad.wmnet with reason: Maintenance [12:32:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1129.eqiad.wmnet with reason: Maintenance [12:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:25] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:32:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298560)', diff saved to https://phabricator.wikimedia.org/P28129 and previous config saved to /var/cache/conftool/dbconfig/20220519-123227-ladsgroup.json [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:27] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:51] ^ gitlab alerts are expected [12:36:38] (03CR) 10Muehlenhoff: WIP: Trial implementation of a private APT repo. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [12:36:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5002.eqsin.wmnet [12:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:52] !log dbmaint s1@eqiad T300775 [12:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:57] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:39:10] !log root@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5002.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [12:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Revert "Fix bogus user object creation in WikiRevisionFactory" (031 comment) [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [12:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P28130 and previous config saved to /var/cache/conftool/dbconfig/20220519-123927-ladsgroup.json [12:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:37] (03PS1) 10Cathal Mooney: Rename 'cr-loopback' policy definition file to 'common-loopback' [homer/public] - 10https://gerrit.wikimedia.org/r/793451 [12:40:30] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5002.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:02] tgr|away: if you are around, I think we will backport Tiemo patch for FileImporter [12:42:32] (03CR) 10Jbond: [C: 03+1] "LGTM but see comment" [puppet] - 10https://gerrit.wikimedia.org/r/793418 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [12:42:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [12:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:50] hashar: o/ [12:42:59] (03CR) 10Slyngshede: WIP: Trial implementation of a private APT repo. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [12:43:21] tgr: I don't know anything about MediaWiki user handling but Tiemo patch sounds sane [12:43:45] (03PS4) 10Slyngshede: WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) [12:43:48] LGTM [12:43:50] worse case is we break FileImporter [12:43:58] (03CR) 10Slyngshede: WIP: Trial implementation of a private APT repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [12:44:00] then given it is already broken, it is more or less a noop :] [12:44:08] jnuche: lets backport! [12:44:48] ok [12:44:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:44:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28131 and previous config saved to /var/cache/conftool/dbconfig/20220519-124456-marostegui.json [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:02] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:45:07] so in short CR+2 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FileImporter/+/793415 [12:45:17] and cherry pick it to 1.39.0-wmf.12 using Gerrit [12:45:23] then CR+2 that one [12:45:30] and hopefully CI will be merging both chagnes [12:45:52] then it is the usual extension backport deployment which you might not be familiar with but I am here to guide :] [12:46:06] (03CR) 10Cathal Mooney: [C: 03+2] Rename 'cr-loopback' policy definition file to 'common-loopback' [homer/public] - 10https://gerrit.wikimedia.org/r/793451 (owner: 10Cathal Mooney) [12:46:57] (03Merged) 10jenkins-bot: Rename 'cr-loopback' policy definition file to 'common-loopback' [homer/public] - 10https://gerrit.wikimedia.org/r/793451 (owner: 10Cathal Mooney) [12:47:50] hashar: there's actually two different patches, call? [12:48:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [12:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:06] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Ottomata) Yes, in order to access datasets via Presto (which usually are in Hive on Hadoop), the user needs analytics-privatedata-user (no ssh or kerberos needed for access just via Superset dashboa... [12:49:38] (03PS3) 10Jbond: CONTRIBUTORS: add delegated licence permision in repo [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) [12:49:40] (03CR) 10Jbond: "great updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [12:50:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:50:46] hashar: if you want to be risk-averse, you can also just do the revert, which is also around somewhere [12:51:05] (03CR) 10Jbond: [C: 03+2] P:wikidough: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/793421 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [12:51:29] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FileImporter/+/793157 [12:52:23] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:52:45] (03CR) 10Jbond: [C: 03+1] "LGTM <3" [puppet] - 10https://gerrit.wikimedia.org/r/793430 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [12:52:49] (03CR) 10Jbond: [C: 03+2] replace wmflib's ensuremounted with stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/793430 (https://phabricator.wikimedia.org/T308639) (owner: 10Majavah) [12:53:37] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [12:54:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T303603)', diff saved to https://phabricator.wikimedia.org/P28133 and previous config saved to /var/cache/conftool/dbconfig/20220519-125434-ladsgroup.json [12:54:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [12:54:40] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28134 and previous config saved to /var/cache/conftool/dbconfig/20220519-125442-ladsgroup.json [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:12] (03CR) 10Jbond: [C: 03+2] blam_stats: update to use per module version [puppet] - 10https://gerrit.wikimedia.org/r/793438 (owner: 10Jbond) [12:55:43] tgr: yeah we will deploy the revert [12:55:44] (03PS1) 10Marostegui: data.yaml: Add tsev to analytics-private-users [puppet] - 10https://gerrit.wikimedia.org/r/793455 (https://phabricator.wikimedia.org/T308616) [12:57:06] I got confused [12:57:11] PROBLEM - Disk space on gitlab1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): /tmp 0 MB (0% inode=96%): /var/tmp 0 MB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops [12:57:11] and summarized on the task [12:57:17] (03CR) 10Jaime Nuche: [C: 03+2] Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [12:57:26] (03CR) 10Hashar: [C: 03+1] Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [12:57:34] (03PS3) 10JMeybohm: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) [12:58:35] (03CR) 10JMeybohm: Add debian directory (031 comment) [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [12:58:37] (03CR) 10Muehlenhoff: [C: 03+1] "Two final typos, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [12:59:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35403/console" [puppet] - 10https://gerrit.wikimedia.org/r/793110 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1300). [13:00:05] koi and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] (03CR) 10Jaime Nuche: [C: 03+2] Revert "Fix bogus user object creation in WikiRevisionFactory" (031 comment) [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [13:00:24] (03PS4) 10Jbond: CONTRIBUTORS: add delegated licence permision in repo [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) [13:00:48] present [13:00:50] (03CR) 10Jbond: "Fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [13:01:14] (03CR) 10Jbond: CONTRIBUTORS: add delegated licence permision in repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [13:01:18] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add delegated licence permision in repo [puppet] - 10https://gerrit.wikimedia.org/r/793436 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [13:04:53] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:56] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) [13:08:25] (03PS1) 10Majavah: httpbb: fix 'Unknown variable' warning on beta [puppet] - 10https://gerrit.wikimedia.org/r/793456 [13:09:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/793455 (https://phabricator.wikimedia.org/T308616) (owner: 10Marostegui) [13:09:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] nitcracker: remove :nutcracker_pools function as its unused [puppet] - 10https://gerrit.wikimedia.org/r/793110 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [13:09:58] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Remove legacy functions - https://phabricator.wikimedia.org/T308639 (10jbond) [13:11:05] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793456 (owner: 10Majavah) [13:11:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28135 and previous config saved to /var/cache/conftool/dbconfig/20220519-131108-marostegui.json [13:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:14] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:11:20] (03CR) 10Ssingh: [C: 03+1] "Thank you very much for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:12:31] !deploy is https://deploy-commands.toolforge.org/bacc/$1 [13:12:31] Sorry, you are not authorized to perform this [13:12:39] oh no :-D [13:13:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [13:13:46] (03CR) 10Majavah: "Could we merge this so it doesn't sit as a local cherry-pick indefinitely?" [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:12] (03CR) 10Ssingh: [C: 03+2] dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:15:13] hmm, so is there anyone could deploy in this window [13:15:57] (03Merged) 10jenkins-bot: Revert "Fix bogus user object creation in WikiRevisionFactory" [extensions/FileImporter] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793157 (https://phabricator.wikimedia.org/T308691) (owner: 10Thiemo Kreuz (WMDE)) [13:16:22] (03CR) 10Jbond: [C: 03+2] "merging" [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [13:16:28] (03PS3) 10Jbond: deployment-prep: re-point to new bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [13:16:34] (03CR) 10Jbond: [V: 03+2] deployment-prep: re-point to new bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [13:18:23] RECOVERY - Disk space on gitlab1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops [13:21:32] (03CR) 10Ssingh: [C: 03+1] dnsdist: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793400 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:21:51] !log jnuche@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/FileImporter/src/Services/WikiRevisionFactory.php: Backport: [[gerrit:793157|Revert "Fix bogus user object creation in WikiRevisionFactory" (T308691)]] (duration: 00m 53s) [13:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:57] T308691: Fatal exception of type "CannotCreateActorException" when trying to export file from zhwikibooks to commons - https://phabricator.wikimedia.org/T308691 [13:23:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:08] koi: hello jnuche deployed a fix for file import which you filed yesterday https://phabricator.wikimedia.org/T308691 [13:25:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:35] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:53] ack, thanks for your hard work! [13:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28136 and previous config saved to /var/cache/conftool/dbconfig/20220519-132614-marostegui.json [13:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:18] koi: please let us know in case the issue still pops up :) [13:26:36] testing.. [13:28:29] jnuche and hashar, replied, seems still some issue [13:29:28] nice [13:29:33] we should check the server logs [13:31:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [13:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] (03PS1) 10Majavah: P:monitoring: cleanup nrpe scripts [puppet] - 10https://gerrit.wikimedia.org/r/793464 [13:33:18] koi: fun thing, I don't see anything in the error logs :-\ [13:33:30] (03CR) 10Muehlenhoff: [C: 03+1] data.yaml: Add tsev to analytics-private-users [puppet] - 10https://gerrit.wikimedia.org/r/793455 (https://phabricator.wikimedia.org/T308616) (owner: 10Marostegui) [13:34:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35404/console" [puppet] - 10https://gerrit.wikimedia.org/r/793464 (owner: 10Majavah) [13:34:28] (03PS2) 10Giuseppe Lavagetto: deployment_server: switch the deployment group to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) [13:34:30] (03PS1) 10Giuseppe Lavagetto: deployment_server: better separate resources between scap 2 and 3 [puppet] - 10https://gerrit.wikimedia.org/r/793465 [13:35:02] (03CR) 10Muehlenhoff: WIP: Trial implementation of a private APT repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [13:35:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [13:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:41] (03PS5) 10Slyngshede: WIP: Trial implementation of a private APT repo. [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) [13:36:49] (03CR) 10Slyngshede: WIP: Trial implementation of a private APT repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [13:38:01] ah it says ImportException: File already on wiki [13:38:02] bah [13:39:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793036 (https://phabricator.wikimedia.org/T308027) (owner: 10Slyngshede) [13:39:48] lol, TNT imported that file yesterday [13:39:55] (03PS11) 10Hokwelum: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 [13:40:01] didn't notice that [13:40:12] aahaha [13:40:20] then to be fair the user reported error should be nicer [13:40:26] (03PS2) 10Giuseppe Lavagetto: deployment_server: better separate resources between scap 2 and 3 [puppet] - 10https://gerrit.wikimedia.org/r/793465 [13:40:28] (03PS3) 10Giuseppe Lavagetto: deployment_server: switch the deployment group to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) [13:41:09] yeah and I couldn't reproduce the error message in my screenshot anymore [13:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28137 and previous config saved to /var/cache/conftool/dbconfig/20220519-134119-marostegui.json [13:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:49] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add tsev to analytics-private-users [puppet] - 10https://gerrit.wikimedia.org/r/793455 (https://phabricator.wikimedia.org/T308616) (owner: 10Marostegui) [13:41:53] koi: and a ImportException: This page has been protected to prevent editing or other actions. [13:42:12] (03PS3) 10Volans: zone_validator: include Netbox data in the check [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) [13:42:14] (03PS1) 10Volans: zone_validator: fix reported line number [dns] - 10https://gerrit.wikimedia.org/r/793466 [13:42:16] (03PS1) 10Volans: zone_validator: fix asset tag matching [dns] - 10https://gerrit.wikimedia.org/r/793467 (https://phabricator.wikimedia.org/T155761) [13:42:18] (03PS1) 10Volans: zone_validator: add new zonefiles [dns] - 10https://gerrit.wikimedia.org/r/793468 (https://phabricator.wikimedia.org/T155761) [13:42:20] (03PS1) 10Volans: zone_validator: improve output of reported issues [dns] - 10https://gerrit.wikimedia.org/r/793469 (https://phabricator.wikimedia.org/T155761) [13:42:22] (03PS1) 10Volans: zone_validator: add support for @ records [dns] - 10https://gerrit.wikimedia.org/r/793470 (https://phabricator.wikimedia.org/T155761) [13:42:24] (03PS1) 10Volans: zone_validator: fix inline ignore errors logic [dns] - 10https://gerrit.wikimedia.org/r/793471 (https://phabricator.wikimedia.org/T155761) [13:42:26] (03PS1) 10Volans: zone_validator: convert format() and + to f-string [dns] - 10https://gerrit.wikimedia.org/r/793472 (https://phabricator.wikimedia.org/T155761) [13:42:28] (03PS1) 10Volans: zone_validator: simplify ignore of multiple issues [dns] - 10https://gerrit.wikimedia.org/r/793473 (https://phabricator.wikimedia.org/T155761) [13:42:34] you mean the file on zhwikibooks? I think that's not an issue for *import* [13:42:37] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Marostegui) Added to analytics-privatedata-user, please allow 30 minutes for puppet to run everywhere. [13:43:22] koi: I pasted the messages at https://phabricator.wikimedia.org/T308691#7941703 [13:44:37] I could not see the first message :( [13:44:46] tried several times [13:45:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35406/console" [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [13:45:09] maybe they are only on the server side [13:45:56] anyway, is it still ok to deploy a patch (the votewiki one)? little bit urgent [13:50:08] (03PS1) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:50:11] (03PS1) 10Jaime Nuche: all wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793476 [13:50:13] (03CR) 10Jaime Nuche: [C: 03+2] all wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793476 (owner: 10Jaime Nuche) [13:50:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793464 (owner: 10Majavah) [13:50:28] (03CR) 10Jbond: [C: 03+2] P:monitoring: cleanup nrpe scripts [puppet] - 10https://gerrit.wikimedia.org/r/793464 (owner: 10Majavah) [13:50:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:51:12] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793476 (owner: 10Jaime Nuche) [13:52:20] jouncebot: next [13:52:21] In 2 hour(s) and 7 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1600) [13:52:24] jouncebot: now [13:52:25] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1300) [13:52:48] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.12 refs T305218 [13:52:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793056 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:54] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [13:53:03] koi: sorry I missed your patch to the deployment window [13:53:10] haven't seen them :-\ [13:53:12] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/793427 (owner: 10Jbond) [13:53:14] 0 0 [13:53:29] jnuche is running the train right now [13:53:37] we should deploy [config] 791797 (deploy commands) votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election task T308397 [13:53:37] T308397: Carry out an admin election of zhwiki on votewiki (May 2022) - https://phabricator.wikimedia.org/T308397 [13:53:40] is that correct? [13:53:46] yeah [13:53:53] will do that after the train [13:54:04] koi: sorry about that, I missed it too [13:54:16] train is done, you can go ahead with the backport [13:54:26] oh thanks [13:54:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28138 and previous config saved to /var/cache/conftool/dbconfig/20220519-135456-ladsgroup.json [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:55:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "The changes of permissions look correct to me. I will still have to chown the content of the directories we're switching to this new model" [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [13:55:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793466 (owner: 10Volans) [13:55:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [13:55:27] doing it now [13:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793467 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:55:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:55:36] (03PS2) 10Hashar: votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791797 (https://phabricator.wikimedia.org/T308397) (owner: 10Stang) [13:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:52] (03CR) 10Hashar: [C: 03+2] "rebase to clear conflict" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791797 (https://phabricator.wikimedia.org/T308397) (owner: 10Stang) [13:55:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793468 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298557)', diff saved to https://phabricator.wikimedia.org/P28139 and previous config saved to /var/cache/conftool/dbconfig/20220519-135624-marostegui.json [13:56:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:56:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:30] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:56:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298557)', diff saved to https://phabricator.wikimedia.org/P28140 and previous config saved to /var/cache/conftool/dbconfig/20220519-135632-marostegui.json [13:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793469 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] (03Merged) 10jenkins-bot: votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791797 (https://phabricator.wikimedia.org/T308397) (owner: 10Stang) [13:57:03] PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:03] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:22] (03PS2) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [13:57:24] ^ these are caused by the ganeti1018 reboot and have no user-visible impact [13:57:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01432 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:57:46] koi: I have deployed the patch to mwdebug1001 and confirmed https://vote.wikimedia.org/wiki/Main_Page switched to what is apparently chinese characeters [13:58:01] yeah, LGTM [13:58:29] (03CR) 10Jcrespo: "Removing reviewers for now- this code is not ready for review yet." [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:58:54] !log hashar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791797|votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election (T308397)]] (duration: 00m 52s) [13:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:59] T308397: Carry out an admin election of zhwiki on votewiki (May 2022) - https://phabricator.wikimedia.org/T308397 [13:59:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [13:59:05] (03CR) 10Jbond: [C: 03+1] "LGTM nit in the commit msg" [dns] - 10https://gerrit.wikimedia.org/r/793470 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [13:59:10] (03CR) 10Andrew Bogott: [C: 03+2] striker: update codfw1dev openstack endpoint name [puppet] - 10https://gerrit.wikimedia.org/r/791456 (owner: 10BryanDavis) [13:59:48] koi: then the site redirects to the main page https://vote.wikimedia.org/wiki/%E9%A6%96%E9%A1%B5 首页 which does not exist [13:59:56] jbond: the puppet failures could be related to the convertion of some plugins [14:00:07] not sure if transient/expected [14:00:11] PROBLEM - LVS mwdebug eqiad port 4444/tcp - mwdebug- mwdebug.svc.eqiad.wmnet IPv4 on mwdebug.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 940 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:00:24] that's ok, I believe nobody will actually view the Main Page :) [14:00:31] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:00:33] RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [14:00:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:49] we just use it to vote (in Speical:SecurePoll) [14:01:04] oh that is for SecurePoll! [14:01:08] so yeah not an issue ;) [14:01:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [14:01:20] (03PS3) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/793471 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:01:36] koi: for the logo files on zhwikiquote I am not familiar with those at all :-\ [14:02:00] hashar: that's fine, I will move them to next window [14:02:17] (03CR) 10David Caro: openstack,admin_script: ran black and isort (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [14:02:27] I think that will later tonight or else on monday [14:02:55] tgr: train done, should we do your GrowthExperiments backport? [config] 793395 (deploy commands) GrothExperiments: Enable Add Link frontend on tier 3 wikis [14:03:33] (03PS2) 10Volans: zone_validator: add support for @ records [dns] - 10https://gerrit.wikimedia.org/r/793470 (https://phabricator.wikimedia.org/T155761) [14:03:34] yeah, thanks. I can also do the deployments if you prefer. [14:03:48] please do :] [14:04:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [14:04:09] (03CR) 10Volans: "addressed comment" [dns] - 10https://gerrit.wikimedia.org/r/793470 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:04:13] PROBLEM - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 18237 bytes in 1.283 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:05:59] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:04] (KubernetesRsyslogDown) firing: rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:06:08] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:06:27] RECOVERY - LVS mwdebug codfw port 4444/tcp - mwdebug- mwdebug.svc.codfw.wmnet IPv4 on mwdebug.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17914 bytes in 1.254 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:06:56] <_joe_> uh what happened to mwdebug? [14:07:16] <_joe_> the kubernetes mw installation returned internal errors [14:07:38] koi: do you want the remaining patches deployed? [14:07:39] one of the etcd backends got restarted [14:07:55] but that *should* ofc have no effect [14:07:56] <_joe_> jayme: but one of the etcd for k8s [14:08:08] <_joe_> thsi was an app-lvele error [14:08:11] yeah [14:08:14] tgr, definitely if you have time :) [14:08:21] <_joe_> 500 Internal Server Error [14:08:25] <_joe_> let's go see in logstash [14:08:26] just mentioning because of weird coincidence [14:08:39] the votewiki one was deployed, right? [14:08:43] yes [14:08:47] yeah [14:09:03] votewiki had $wgLanguageCode changed from en to zh [14:09:23] and we have pushed wmf.12 to all wikis before that [14:09:32] (03CR) 10Jbond: [C: 03+1] "LGTM inline, nit/comment" [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:09:46] !log systemctl restart rsyslog on kubernetes1011,kubestage1003 [14:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28141 and previous config saved to /var/cache/conftool/dbconfig/20220519-141001-ladsgroup.json [14:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:56] <_joe_> https://logstash.wikimedia.org/goto/dc6e1e7b3774f84752475c12251bb485 [14:10:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:10:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:11:01] <_joe_> Amir1: ^^ [14:11:05] koi: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/792748 is a no-op, right? [14:11:22] <_joe_> ahhh nevermind [14:11:31] <_joe_> jayme: can you check the deploy to mwdebug is working? [14:12:02] tgr: yeah, just comment and yaml update (no actually effect) [14:12:19] (03CR) 10Gergő Tisza: [C: 03+2] zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [14:12:24] _joe_: yeah [14:13:05] (03Merged) 10jenkins-bot: zhwikiquote: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792748 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [14:13:07] (03PS4) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:13:52] (03PS2) 10Gergő Tisza: zhwikiquote: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793119 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [14:14:06] (03CR) 10Gergő Tisza: [C: 03+2] zhwikiquote: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793119 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [14:14:20] <_joe_> hashar: sorry, what version should enwiki be on now? [14:14:50] (03Merged) 10jenkins-bot: zhwikiquote: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793119 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [14:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298557)', diff saved to https://phabricator.wikimedia.org/P28142 and previous config saved to /var/cache/conftool/dbconfig/20220519-141453-marostegui.json [14:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:59] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:15:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/793472 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:15:12] koi: do you want to test the logo changes? [14:15:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [14:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:20] yeah [14:15:39] _joe_: failed at 2022-05-19T13:55:48.648535 [14:15:59] <_joe_> jayme: uhm [14:16:03] _joe_: that should clear up on its own :/ [14:16:24] (03PS1) 10Volans: dns: convert format() to f-strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793483 [14:16:25] (03PS1) 10Volans: dns: add a comment for skipped PTR [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793484 (https://phabricator.wikimedia.org/T155761) [14:16:27] (03PS5) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:16:38] _joe_: 1.39.0-wmf.12 everywhere [14:16:39] _joe_: that's weird it's causing this many exceptions [14:16:40] <_joe_> Amir1: it won't, the deployment to k8s failed [14:17:01] <_joe_> let me try to deploy manually [14:17:29] (03CR) 10Jbond: [C: 03+1] zone_validator: simplify ignore of multiple issues [dns] - 10https://gerrit.wikimedia.org/r/793473 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:17:34] koi: it's on mwdebug1001 [14:17:40] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:17:41] looking [14:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:46] _joe_: I think it failed during diff [14:17:57] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=GET https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:17:59] probably because the apiserver closed connection [14:18:06] <_joe_> ahh yes [14:18:09] because of the etcd reboot (circle closed) [14:18:10] <_joe_> so in a weird way [14:18:17] <_joe_> yes it had to do with etcd lol [14:18:35] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:42] tgr: LGTM [14:20:06] !log tgr@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:793119|zhwikiquote: Optimize logo per commons files (T308620)]] (duration: 00m 50s) [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:12] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [14:20:25] RECOVERY - LVS mwdebug eqiad port 4444/tcp - mwdebug- mwdebug.svc.eqiad.wmnet IPv4 on mwdebug.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17796 bytes in 1.141 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:21:06] _joe_: so it's all good? Or I broke things again [14:21:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [14:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:39] <_joe_> no you broke nothing [14:21:41] <_joe_> moritzm did [14:21:44] <_joe_> :P [14:21:59] phew [14:22:07] I love it when I'm not breaking stuff [14:22:27] koi: should be live [14:22:29] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:22:41] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:49] (03CR) 10Gergő Tisza: [C: 03+2] GrothExperiments: Enable Add Link frontend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793395 (https://phabricator.wikimedia.org/T304542) (owner: 10Gergő Tisza) [14:22:58] confirmed and thanks [14:23:29] and I'm going to break more! [14:23:37] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:02] <_joe_> moritzm: <3 [14:24:21] tgr: I'm not sure but no-op patch still need a sync? [14:24:23] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004604 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:24:24] <_joe_> but yeah, we can't lose etcd for k8s during mediawiki deployment it seems [14:24:40] <_joe_> jouncebot: next [14:24:40] In 1 hour(s) and 35 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1600) [14:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28143 and previous config saved to /var/cache/conftool/dbconfig/20220519-142507-ladsgroup.json [14:25:11] <_joe_> jayme: did you remove the error lock for mwdebug? [14:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:17] _joe_: nope [14:25:43] assumed you has because you ran manually [14:25:47] *you had [14:25:49] no sync means those comments will be missing from the version of the code deployed on the appservers. Doesn't really make a difference. [14:26:45] (03PS2) 10Gergő Tisza: GrothExperiments: Enable Add Link frontend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793395 (https://phabricator.wikimedia.org/T304542) [14:26:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:27:04] <_joe_> topranks, XioNoX ^^ [14:27:04] (03CR) 10Gergő Tisza: [C: 03+2] GrothExperiments: Enable Add Link frontend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793395 (https://phabricator.wikimedia.org/T304542) (owner: 10Gergő Tisza) [14:27:13] (03PS3) 10Jbond: labs - puppet_alert.py: Update script to output last log messages [puppet] - 10https://gerrit.wikimedia.org/r/793427 [14:27:33] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/793427 (owner: 10Jbond) [14:28:13] (03Merged) 10jenkins-bot: GrothExperiments: Enable Add Link frontend on tier 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793395 (https://phabricator.wikimedia.org/T304542) (owner: 10Gergő Tisza) [14:29:48] _joe_: should I remove the lock (as it's still there)? :) [14:30:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes1019:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:31:02] (KubernetesRsyslogDown) resolved: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:32:01] <_joe_> jayme: yes please :) [14:32:41] (03PS6) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:32:58] _joe_: ack [14:33:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793483 (owner: 10Volans) [14:33:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793484 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:34:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [14:34:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:34:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] (03PS7) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:36:48] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793395|GrothExperiments: Enable Add Link frontend on tier 3 wikis (T304542)]] (duration: 00m 50s) [14:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] T304542: Deploy "add a link" to third round of wikis - https://phabricator.wikimedia.org/T304542 [14:36:57] !log EU mid-day deploys done [14:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [14:37:46] (03CR) 10Hokwelum: Add Clarkson university host to list of dumps mirrors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [14:38:45] (03PS8) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [14:40:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T303603)', diff saved to https://phabricator.wikimedia.org/P28144 and previous config saved to /var/cache/conftool/dbconfig/20220519-144013-ladsgroup.json [14:40:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:40:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:20] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:40:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T303603)', diff saved to https://phabricator.wikimedia.org/P28145 and previous config saved to /var/cache/conftool/dbconfig/20220519-144021-ladsgroup.json [14:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] (03PS4) 10Volans: zone_validator: include Netbox data in the check [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) [14:40:40] (03PS2) 10Volans: zone_validator: convert format() and + to f-string [dns] - 10https://gerrit.wikimedia.org/r/793472 (https://phabricator.wikimedia.org/T155761) [14:40:42] (03PS2) 10Volans: zone_validator: simplify ignore of multiple issues [dns] - 10https://gerrit.wikimedia.org/r/793473 (https://phabricator.wikimedia.org/T155761) [14:41:26] _joe_: thanks, seems that BGP blipped to doh1002 and durum1001 [14:41:36] (03CR) 10Volans: "addressed comment" [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:41:54] came back a short time after, unfortunately as that check is typically in "warning" status, due to external peers who usually at least 1 is down, the recovery doesn't show here [14:42:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server: better separate resources between scap 2 and 3 [puppet] - 10https://gerrit.wikimedia.org/r/793465 (owner: 10Giuseppe Lavagetto) [14:42:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [14:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] sukhe: FYI (I still suspect of BFD is possibly a little aggresive) [14:42:44] I checked they are not running on the same Ganeti host. [14:43:01] They are both in row D, so some common links for the traffic. [14:43:07] nothing major to worry about I think. [14:45:17] (03PS1) 10Cathal Mooney: Ammend cloudsw-loopback filter to allow BGP in VRF [homer/public] - 10https://gerrit.wikimedia.org/r/793489 (https://phabricator.wikimedia.org/T304989) [14:47:23] topranks: thanks for checking as always. I looked at it the other day but couldn't find anything specific to pin point the issue [14:47:54] Yeah, I had been suspecting the hypervisor scheduling, VMs being paused for brief times upsetting BFD. [14:48:02] I think that's probably not the case given these are on separate hosts. [14:48:25] It's probably still due to BFD missed packets, but more likely to be drops on the row uplinks from D to the CR [14:49:14] Which is an issue we know we have, and ultimately we are addressing with newer kit, faster links on the network side. [14:49:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:24] ah [14:49:27] lol [14:49:30] oh, is this us again? [14:49:49] yeah the same two [14:49:57] oh yeah, it's us [14:50:19] (03CR) 10David Caro: [C: 03+2] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/793427 (owner: 10Jbond) [14:51:38] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: switch the deployment group to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/793416 (https://phabricator.wikimedia.org/T303857) (owner: 10Giuseppe Lavagetto) [14:51:55] have been some drops on the asw2-d switches going out to cr2-eqiad over the past few mins [14:52:02] but nothing more than what we generally see [14:52:24] <_joe_> dcaro: lmk when you're done with puppet-merge [14:52:39] _joe_: hey, I was going to ping you xd, can I merge your change? [14:52:46] <_joe_> yes [14:52:49] 👍 [14:52:56] (03PS1) 10Muehlenhoff: Enable component/ganeti3 for the esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/793491 (https://phabricator.wikimedia.org/T308238) [14:53:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/793057 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:53:44] sukhe: not seeing any great signs of packet loss testing to durum1001 [14:53:45] https://phabricator.wikimedia.org/P28146 [14:54:18] thanks topranks <3 [14:54:26] Let's see how it goes, if it keeps flapping we might want to reduce the BFD (keepalive) timers [14:54:47] If not I'll mention perhaps doing that anyway to Arzhel when he is back in a weeks time [14:55:09] I am sure if you have looked already but on doh1002 for example, is the error message from bird helpful in any way? like what is "recieved unknown error 6.9"!? [14:55:35] I suspect that's the cause, and we won't have a full fix till we move to peering from VM -> top-of-rack switch, with that new hardware helping generally with tail drops also. [14:55:55] Might be helpful if we looked it up, on the face of it it doesn't tell me much :) [14:56:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T303603)', diff saved to https://phabricator.wikimedia.org/P28147 and previous config saved to /var/cache/conftool/dbconfig/20220519-145608-ladsgroup.json [14:56:13] actually the Juniper side shed's light on the number [14:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:15] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:56:15] May 19 14:48:48 re0.cr2-eqiad rpd[13141]: bgp_bfd_callback:161: NOTIFICATION sent to 2620:0:861:4:208:80:155:112 (External AS 64605): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down [14:56:17] yeah [14:56:23] It's always been dropped BFD when this happens [14:56:37] So basically router tears down session cos it hasn't got a keepalive for 900ms [14:58:58] So packets lost somewhere, be that on the network or on the hypervisor/vm cos the CPU can't keep up. [14:59:10] Given these are on separate hosts it suggests the former [14:59:26] yeah in a way it's good to have at least made that distinction :) [14:59:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [14:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:19] (03CR) 10Cathal Mooney: [C: 03+2] Ammend cloudsw-loopback filter to allow BGP in VRF [homer/public] - 10https://gerrit.wikimedia.org/r/793489 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:00:38] !log powerdown gerrit2002 for relocation [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] (03CR) 10Ahmon Dancy: [C: 03+1] scap: do not restart jobrunners on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792980 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:00:51] <_joe_> !log oblivian@deploy2002:/srv/mediawiki-staging $ sudo find . -group wikidev -exec chgrp wikidev "{}" \; [15:00:53] (03CR) 10Ahmon Dancy: [C: 03+1] scap: enable restarting php-fpm on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792981 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:40] (03Merged) 10jenkins-bot: Ammend cloudsw-loopback filter to allow BGP in VRF [homer/public] - 10https://gerrit.wikimedia.org/r/793489 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:04:20] (03PS4) 10JMeybohm: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) [15:04:26] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki_canaries: disable opcache revalidation [puppet] - 10https://gerrit.wikimedia.org/r/792983 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:04:55] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: disable revalidation everywhere [puppet] - 10https://gerrit.wikimedia.org/r/792984 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [15:05:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [15:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:53] (03PS12) 10ArielGlenn: Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [15:07:28] (03CR) 10ArielGlenn: [C: 03+2] Add Clarkson university host to list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/793085 (owner: 10Hokwelum) [15:07:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:42] (03CR) 10JMeybohm: [C: 04-1] Add helmfile configuration for image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:50] (03PS5) 10Hnowlan: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) [15:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P28148 and previous config saved to /var/cache/conftool/dbconfig/20220519-151113-ladsgroup.json [15:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:29] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:12:42] (03CR) 10JMeybohm: [C: 03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:14:18] ^^^ I’ll ack that alert for cloudsw1-c8, just setting this peering up atm [15:14:51] <_joe_> !log deploy1002:/srv/mediawiki-staging $ find . -group wikidev -print0 | sudo xargs -0 -n 100 chgrp -h deployment -- [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:59] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:18:43] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:19:09] !log oblivian@deploy1002 Synchronized README: null sync-file to verify the switch to the deployment group (duration: 00m 50s) [15:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:29] (03PS1) 10Cathal Mooney: Change new cloudsw in e4/f4 to use commmon-loopback filter [homer/public] - 10https://gerrit.wikimedia.org/r/793493 (https://phabricator.wikimedia.org/T304989) [15:20:50] _joe_: let me know when it's safe to scap (dumps) again [15:20:55] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5003.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [15:20:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5003.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [15:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:21] <_joe_> apergos: are you merging a patch too? [15:21:25] <_joe_> if so that would be great [15:21:27] XioNoX: hello i can no longer run the provision a server script in codfw without proving a cable IP when i leave it blank i am getting "Cable ID already assigned in codfw." [15:21:29] <_joe_> you'll be our guinea pig [15:21:37] patch was already merged earlier, sorry [15:21:57] I've pulled it to the deployment host a few moments ago, just need to send it around [15:22:09] adn this is the dumps repo, not mw [15:22:37] ok to do it? [15:23:25] (03CR) 10Cathal Mooney: [C: 03+2] Change new cloudsw in e4/f4 to use commmon-loopback filter [homer/public] - 10https://gerrit.wikimedia.org/r/793493 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:23:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:57] _joe_: ? [15:24:03] (03Merged) 10jenkins-bot: Change new cloudsw in e4/f4 to use commmon-loopback filter [homer/public] - 10https://gerrit.wikimedia.org/r/793493 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [15:24:03] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti5003 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye. [15:24:19] <_joe_> apergos: yes [15:24:25] !log ariel@deploy1002 Started deploy [dumps/dumps@cd30939]: use dbgroupdefault for most jobs [15:24:25] <_joe_> nothing changed for that [15:24:28] (03PS5) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) [15:24:29] !log ariel@deploy1002 Finished deploy [dumps/dumps@cd30939]: use dbgroupdefault for most jobs (duration: 00m 04s) [15:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:35] {{done}} ty [15:24:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298560)', diff saved to https://phabricator.wikimedia.org/P28149 and previous config saved to /var/cache/conftool/dbconfig/20220519-152457-ladsgroup.json [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:03] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [15:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P28150 and previous config saved to /var/cache/conftool/dbconfig/20220519-152618-ladsgroup.json [15:26:20] (03CR) 10JMeybohm: Add debian directory (031 comment) [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [15:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:30] (03PS1) 10Ssingh: aptrepo: add a component for dnsdist/pdns-recursor for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) [15:28:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:44] today is going to be a good day, I managed to type moritzm's name correctly in the reviewers [15:31:52] (03PS9) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [15:33:01] 10SRE, 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10Papaul) @Marostegui I don't see anything on my end as well. maybe just a temporary memory issue. Showing all 128G RAM on the server. We can close the task if w... [15:35:33] (03PS2) 10Jforrester: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [15:35:51] 10SRE, 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10Marostegui) Thanks @Papaul! @dcausse up to you :-) [15:36:42] (03PS1) 10Jcrespo: dbbackups: Add django database password and secret key for pampinus [labs/private] - 10https://gerrit.wikimedia.org/r/793498 (https://phabricator.wikimedia.org/T283017) [15:36:59] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Add django database password and secret key for pampinus [labs/private] - 10https://gerrit.wikimedia.org/r/793498 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [15:37:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bullseye [15:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:35] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye [15:39:40] (03PS10) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [15:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28151 and previous config saved to /var/cache/conftool/dbconfig/20220519-154003-ladsgroup.json [15:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:12] (03PS3) 10Jforrester: Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [15:41:14] (03PS1) 10Jforrester: Add SimilarEditors extension – II: Add to InitialiseSettings, default off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793500 (https://phabricator.wikimedia.org/T306909) [15:41:16] (03PS1) 10Jforrester: Add SimilarEditors extension – III: Add to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793501 (https://phabricator.wikimedia.org/T306909) [15:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T303603)', diff saved to https://phabricator.wikimedia.org/P28152 and previous config saved to /var/cache/conftool/dbconfig/20220519-154124-ladsgroup.json [15:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:30] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [15:47:37] (03PS11) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [15:50:46] (03PS1) 10Ottomata: Release 2.1.4-py3.7-5 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/793504 (https://phabricator.wikimedia.org/T307115) [15:53:14] (03PS12) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [15:54:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit2002.wikimedia.org with OS bullseye [15:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:41] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye executed with errors: - gerrit2002... [15:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28153 and previous config saved to /var/cache/conftool/dbconfig/20220519-155509-ladsgroup.json [15:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:50] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bullseye [15:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:15] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye [15:58:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] (03PS1) 10Jcrespo: dbbackups: Fix hiera key formatting for db password and django secret [labs/private] - 10https://gerrit.wikimedia.org/r/793505 (https://phabricator.wikimedia.org/T283017) [15:59:56] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Fix hiera key formatting for db password and django secret [labs/private] - 10https://gerrit.wikimedia.org/r/793505 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [16:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [16:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:37] (03PS2) 10Giuseppe Lavagetto: deployment_server: use helm_user_group everywhere for consistency [puppet] - 10https://gerrit.wikimedia.org/r/793417 [16:04:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35418/console" [puppet] - 10https://gerrit.wikimedia.org/r/793417 (owner: 10Giuseppe Lavagetto) [16:04:15] (03PS13) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [16:06:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: use helm_user_group everywhere for consistency [puppet] - 10https://gerrit.wikimedia.org/r/793417 (owner: 10Giuseppe Lavagetto) [16:10:03] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [16:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298560)', diff saved to https://phabricator.wikimedia.org/P28154 and previous config saved to /var/cache/conftool/dbconfig/20220519-161014-ladsgroup.json [16:10:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:10:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:21] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:10:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298560)', diff saved to https://phabricator.wikimedia.org/P28155 and previous config saved to /var/cache/conftool/dbconfig/20220519-161022-ladsgroup.json [16:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:29] (03PS14) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [16:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bullseye [16:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:12] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye completed: - gerrit2002 (**PASS**)... [16:18:55] (03PS7) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [16:18:58] (03PS1) 10JMeybohm: Add crds.yaml fixtures to charts and istio schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/793509 (https://phabricator.wikimedia.org/T306165) [16:19:54] (03Abandoned) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) (owner: 10JMeybohm) [16:23:16] PROBLEM - Check systemd state on an-tool1011 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:28] (03PS1) 10Volans: interface_automation: don't fail on empty cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793512 (https://phabricator.wikimedia.org/T308768) [16:27:39] (03PS15) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [16:28:06] (03CR) 10Volans: [C: 03+2] "Self merging to unblock DCOps" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793512 (https://phabricator.wikimedia.org/T308768) (owner: 10Volans) [16:28:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [16:28:55] (03Merged) 10jenkins-bot: interface_automation: don't fail on empty cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793512 (https://phabricator.wikimedia.org/T308768) (owner: 10Volans) [16:30:52] (03PS16) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [16:31:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:17] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [16:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [16:31:45] 10SRE, 10Infrastructure-Foundations, 10netops: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Volans) @Papaul the above patch was merged and deployed. I think it should fix the issue. Please resolve the task if that's the case or let me know wh... [16:34:06] (03PS5) 10JMeybohm: Add debian directory [debs/kubeconform] - 10https://gerrit.wikimedia.org/r/792999 (https://phabricator.wikimedia.org/T306165) [16:35:30] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) 05Open→03Resolved This is complete [16:35:59] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [16:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:46] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [16:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:35] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [16:44:42] (03CR) 10Zabe: vagrant: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:48:08] 10SRE, 10Airflow, 10Data-Engineering, 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) [16:48:32] (03CR) 10Ryan Kemper: [C: 03+2] rdf query service: Apply WARN log level only to com.bigdata [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson) [16:49:04] 10SRE, 10Airflow, 10Data-Engineering, 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) Mostly done, but to finish we are blocking on waiting for Gitlab Docker images {T304845} [16:51:53] (03CR) 10Ryan Kemper: [C: 03+2] rdf query service: Apply WARN log level only to com.bigdata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson) [16:54:29] (03PS17) 10Jcrespo: [WIP]django: Create custom django module and apply it to backupmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/793475 (https://phabricator.wikimedia.org/T283017) [16:55:56] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@95c1f50]: (no justification provided) [16:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:08] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@95c1f50]: (no justification provided) (duration: 00m 12s) [16:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:10] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Dzahn) Thank you very much @Papaul [17:03:07] !log otto@deploy1002 Started deploy [airflow-dags/analytics@95c1f50]: (no justification provided) [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:29] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@95c1f50]: (no justification provided) (duration: 00m 21s) [17:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:11] (03CR) 10Cathal Mooney: [C: 03+2] Add DHCP config files for new cloud host nets and rename older files [puppet] - 10https://gerrit.wikimedia.org/r/791595 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [17:06:29] (03CR) 10Cathal Mooney: [C: 03+2] Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [17:06:43] (03PS5) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) [17:08:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [17:08:19] (03CR) 10Dzahn: [C: 03+2] annualreport: update redirect to 2020-2021 report [puppet] - 10https://gerrit.wikimedia.org/r/793447 (https://phabricator.wikimedia.org/T308737) (owner: 10Ammarpad) [17:09:00] andrewbogott: yes, you can merge both if you see mine :) classic conflict [17:09:14] ah, even better. it let's you pick just yours now [17:09:23] and I am unlocked, nice [17:09:55] it says no changes to merge so I'm assuming everything is fine :) [17:13:51] andrewbogott: yea, puppet-merge became smarter over time I think [17:14:03] it let's you say "merge only my stuff" in some cases [17:14:31] so while I had to wait for you to be done to get the lock file.. then it was only my change [17:14:36] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.111`. Pre-deploy tests passing on canary `wdqs1003` [17:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:44] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@a493d7f]: 0.3.111 [17:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:31] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 3 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10thcipriani) 05Open→03Resolved a:03jbond Confirmed working: ` thcipriani@gitlab1001:... [17:15:37] (03PS1) 10Majavah: P:openstack::rabbitmq: manage all users [puppet] - 10https://gerrit.wikimedia.org/r/793519 [17:16:08] !log [WDQS Deploy] Tests passing following deploy of `0.3.111` on canary `wdqs1003`; proceeding to rest of fleet [17:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35424/console" [puppet] - 10https://gerrit.wikimedia.org/r/793519 (owner: 10Majavah) [17:18:29] (03PS1) 10Cathal Mooney: Modifications to install server netboot.cfg ommited in previous change [puppet] - 10https://gerrit.wikimedia.org/r/793520 (https://phabricator.wikimedia.org/T304989) [17:20:22] the change to network/data/data.yaml means a ferm reload on $everything [17:22:55] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@a493d7f]: 0.3.111 (duration: 08m 11s) [17:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:58] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [17:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:08] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [17:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:13] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [17:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:02] !log [WCQS Deploy] Gearing up for deploy of wcqs `0.3.111` [17:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:10] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@a493d7f] (wcqs): Deploy 0.3.111 to WCQS [17:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:14] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@a493d7f] (wcqs): Deploy 0.3.111 to WCQS (duration: 03m 03s) [17:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:43] !log [WCQS Deploy] Tests looked good following deploy of `0.3.111` to canary `wcqs1002.eqiad.wmnet`; proceeded to rest of fleet [17:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:04] !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts: `sudo -E cumin 'A:wcqs-public' 'systemctl restart wcqs-updater'` [17:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:49] !log [WCQS Deploy] Successful test query placed on commons-query.wikimedia.org, there's no relevant criticals in Icinga, and Grafana looks good. WCQS deploy complete [17:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:44] (03PS10) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [17:37:27] (03PS1) 10Majavah: openstack: encapi: add a custom error class [puppet] - 10https://gerrit.wikimedia.org/r/793524 (https://phabricator.wikimedia.org/T274666) [17:37:40] (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [17:40:28] (03PS1) 10BBlack: Add dumps mapping to cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/793525 (https://phabricator.wikimedia.org/T306550) [17:40:34] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:43:10] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:00] (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [17:50:01] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: add a custom error class [puppet] - 10https://gerrit.wikimedia.org/r/793524 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [17:50:53] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [17:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:34] !log T306899 Rolled `wdqs` and `wcqs` deploys to adjust logging settings. Hoping this gives us more visibility on the 500 errors WCQS users have been experiencing. [17:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] T306899: WCQS 500 errors - https://phabricator.wikimedia.org/T306899 [17:55:59] !log [WDQS Deploy] Slight amendment to the above; we're seeing status `Unknown` for `Categories endpoint` and `Categories update lag`. They've been warning for ~24h so it didn't surface following the deploy, but looking into that now [17:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:14] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:17] (03PS1) 10Ryan Kemper: query_service: noop to check pcc catalog [puppet] - 10https://gerrit.wikimedia.org/r/793529 [18:00:36] (03PS2) 10Ryan Kemper: query_service: noop to check pcc catalog [puppet] - 10https://gerrit.wikimedia.org/r/793529 [18:00:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:01:58] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35425/console" [puppet] - 10https://gerrit.wikimedia.org/r/793529 (owner: 10Ryan Kemper) [18:04:30] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2054 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:08:24] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:19] (03Abandoned) 10Ryan Kemper: query_service: noop to check pcc catalog [puppet] - 10https://gerrit.wikimedia.org/r/793529 (owner: 10Ryan Kemper) [18:18:18] !log [WDQS Deploy] Traced the failure back to https://gerrit.wikimedia.org/r/c/operations/puppet/+/792700 presumably; trying to see what we can do to fix up the patch without having to revert it since it touches stuff besides query service [18:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:35] To be explicit this is a monitoring issue and not indicative of actual service problems [18:28:38] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:29:01] (03PS1) 10Ryan Kemper: query_service: check_categories lives in /usr/local/lib now [puppet] - 10https://gerrit.wikimedia.org/r/793530 [18:29:54] !log [WDQS Deploy] Okay, so a recent refactor changed where the `check_categories.py` lives. Previously it was `/usr/lib/nagios/plugins/check_categories.py` and now it's `/usr/local/lib/nagios/plugins/check_categories.py`. So https://gerrit.wikimedia.org/r/793530 should fix things now [18:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:00] (03PS2) 10Ryan Kemper: query_service: check_categories lives in /usr/local/lib now [puppet] - 10https://gerrit.wikimedia.org/r/793530 [18:31:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/793530 (owner: 10Ryan Kemper) [18:32:06] (03PS3) 10Ryan Kemper: query_service: check_categories lives in /usr/local/lib now [puppet] - 10https://gerrit.wikimedia.org/r/793530 (https://phabricator.wikimedia.org/T308601) [18:32:35] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10RobH) 05Open→03Resolved [18:34:28] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35426/console" [puppet] - 10https://gerrit.wikimedia.org/r/793530 (https://phabricator.wikimedia.org/T308601) (owner: 10Ryan Kemper) [18:35:50] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2054 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:36:46] (03CR) 10Andrew Bogott: [C: 03+1] Move rabbitmq to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:37:34] (03CR) 10Ryan Kemper: [V: 03+1] "Confirmed that the following works when manually ran on the host:" [puppet] - 10https://gerrit.wikimedia.org/r/793530 (https://phabricator.wikimedia.org/T308601) (owner: 10Ryan Kemper) [18:37:36] (03CR) 10Ebernhardson: [C: 03+1] query_service: check_categories lives in /usr/local/lib now [puppet] - 10https://gerrit.wikimedia.org/r/793530 (https://phabricator.wikimedia.org/T308601) (owner: 10Ryan Kemper) [18:38:04] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] query_service: check_categories lives in /usr/local/lib now [puppet] - 10https://gerrit.wikimedia.org/r/793530 (https://phabricator.wikimedia.org/T308601) (owner: 10Ryan Kemper) [18:41:16] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:41:44] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:09] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Dzahn) @jbond has good points here, I think. Could be clarified what membership in "ops" is for. My first guesses of the most important parts would be... [18:45:07] !log [WDQS Deploy] Deployed https://gerrit.wikimedia.org/r/793530; ran puppet agent across wdqs* and just kicked off a re-check of the NRPE alerts. We'll see if that clears the Unknown state up [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) 05Open→03Resolved @MoritzMuehlenhoff this host is now ready to return to service, its memory has been replaced. [18:49:54] !log [WDQS Deploy] `Unknown` status resolved following deploy of https://gerrit.wikimedia.org/r/793530 ; wdqs categories monitoring is healthy again. We're done here [18:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:00] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:58:14] (03CR) 10Dzahn: [C: 03+2] dns: add PTR records for gitlab-replica-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [18:58:17] (03PS2) 10Dzahn: dns: add PTR records for gitlab-replica-new.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:01:52] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:05:49] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jbond) There is already an group named sre-admins (used for SRE's without root), that gives the same SSO access to web service ops the ops group, but... [19:05:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:49] (03PS1) 10Andrew Bogott: Cinder-backups: prepare for upgrade to version Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/793531 [19:07:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10Jclark-ctr) a:05Jclark-ctr→03Papaul [19:08:14] (03CR) 10Andrew Bogott: [C: 03+2] Cinder-backups: prepare for upgrade to version Wallaby [puppet] - 10https://gerrit.wikimedia.org/r/793531 (owner: 10Andrew Bogott) [19:11:14] (03CR) 10Dzahn: "before:" [dns] - 10https://gerrit.wikimedia.org/r/793067 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:16:21] (03Abandoned) 10Andrew Bogott: mariadb wmcs ferm: add ipv6 firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/591065 (owner: 10Andrew Bogott) [19:17:57] (03PS6) 10Andrew Bogott: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:22:22] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_imagecatalog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:26] (03CR) 10Andrew Bogott: striker: Add profile to provision docker container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:30:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [19:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:06] (03PS1) 10Jelto: install_server: add custom partman config for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) [19:46:02] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:48:10] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Swift [19:51:01] (03CR) 10Jelto: "I'd like to create the following layout on the new GitLab hosts (~900gb of disk space):" [puppet] - 10https://gerrit.wikimedia.org/r/793534 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:58:16] !log bking@relforge1004: banned relforge1003 from main and alpha clusters in preparation for reimage T308770 [19:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:22] T308770: Reimage relforge elastic hosts from Stretch to Bullseye - https://phabricator.wikimedia.org/T308770 [20:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220519T2000). [20:00:04] koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:20] hi koi [20:01:24] hi! [20:03:17] (03PS6) 10Thcipriani: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya) [20:10:08] (03CR) 10BryanDavis: [C: 04-1] "comment/spelling nits to fix, but mostly a reminder not to worry about merging until the week of 2022-05-23 when I will have time to follo" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:10:48] Hi thcipriani, is there something blocking [20:12:24] (03CR) 10Bking: [C: 03+2] bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya) [20:13:15] (03Merged) 10jenkins-bot: bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791734 (https://phabricator.wikimedia.org/T307904) (owner: 10Yahya) [20:14:06] koi: nope, we're doing some deployment training, so we're moving a little slowly, thanks for bearing with us ;) [20:15:42] koi https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/791734 is up on mwdebug1001 , does it look OK? [20:15:55] looking [20:17:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:34] (03PS3) 10Bking: zhwikiversity: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792985 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:19:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:59] hmm, somehow strange that I could not see any recommendation under the article [20:21:38] koi apologies, we missed a step. Try it now? [20:22:04] aha yeah it works now [20:22:17] Cool, sorry for the delay [20:24:16] !log bking@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791734|bnwikivoyage: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage (T307904)]] (duration: 00m 50s) [20:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:21] T307904: Set $wgRelatedArticlesUseCirrusSearch to true on bnwikivoyage - https://phabricator.wikimedia.org/T307904 [20:26:21] Also is it possible to sync a already merged patch? I mean https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/792748 [20:27:00] (scheduled in last window, not synced yet but it is indeed needed to be synced [20:27:01] it is possible, did that one not get synced? [20:27:31] someone said this is a no-op patch so do no sync of that [20:27:40] (03CR) 10Bking: [C: 03+2] zhwikiversity: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792985 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:28:01] but a bot on commonswiki relies on it (it read yaml file from noc.wikimedia) [20:28:23] (03Merged) 10jenkins-bot: zhwikiversity: Declare commons files for logo and its variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792985 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:28:54] koi: got it, it looks like it touches the same files as ^ patch, is that right? [20:29:06] yeah same file [20:29:17] cool, we'll sync it with the next one then [20:29:29] many thanks! [20:30:05] koi just pushed the next patch to mwdebug1001 , let me know if it looks OK [20:30:46] you mean "Declare commons files for ..."? No need to test them IMO [20:31:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:31:23] koi: got it, thanks [20:33:20] !log bking@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792985|zhwikiversity: Declare commons files for logo and its variant (T308620)]] (duration: 00m 53s) [20:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:27] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:34:25] !log bking@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:792985|zhwikiversity: Declare commons files for logo and its variant (T308620)]] (duration: 00m 50s) [20:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:35:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:55] (03CR) 10Bking: [C: 03+2] zhwikiversity: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793128 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:37:41] (03Merged) 10jenkins-bot: zhwikiversity: Optimize logo per commons files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793128 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:40:44] !log bking@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:793128|zhwikiversity: Optimize logo per commons files (T308620)]] (duration: 00m 51s) [20:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:49] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:41:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:42:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] !log UTC late deploys done [20:49:21] !log UTC late deploys done [20:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:37] til it doesn't like a space in front [20:50:15] the HISTCONTROL=ignorespace of irc [20:50:49] ^ also TIL [20:51:08] the !log thing [20:52:13] oh yeah, I've already managed to make that mistake soooo many times ;( [21:03:02] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:19] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff Moritz, ganeti5003 firmware updates completed: nic 21.85.21.92, bios 2.14.2, idrac 5.10.10.00. system booted back into OS and is online for reimage later.... [21:15:50] (03PS1) 10Dwisehaupt: Add in A and PTR records for civicrm-staging [dns] - 10https://gerrit.wikimedia.org/r/793540 (https://phabricator.wikimedia.org/T308672) [21:16:22] (03PS2) 10Dwisehaupt: Add in A and PTR records for civicrm-staging [dns] - 10https://gerrit.wikimedia.org/r/793540 (https://phabricator.wikimedia.org/T308672) [21:18:22] (03CR) 10Jgreen: [C: 03+2] Add in A and PTR records for civicrm-staging [dns] - 10https://gerrit.wikimedia.org/r/793540 (https://phabricator.wikimedia.org/T308672) (owner: 10Dwisehaupt) [21:19:55] (03PS2) 10Dwisehaupt: Add missing forward entries for frack nat addresses [dns] - 10https://gerrit.wikimedia.org/r/793121 (https://phabricator.wikimedia.org/T308672) [21:20:16] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:23:15] (03CR) 10Jgreen: [C: 03+2] Add missing forward entries for frack nat addresses [dns] - 10https://gerrit.wikimedia.org/r/793121 (https://phabricator.wikimedia.org/T308672) (owner: 10Dwisehaupt) [21:29:06] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 4, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:35:48] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:35:57] dwisehaupt: you might already know but keep in mind you need to run the cookbook for DNS changes nowadays [21:36:19] just because I got reminded myself the other day, heh [21:38:00] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 4, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:39:35] (03PS1) 10JHathaway: dumps: remove generic python 2.25.1 user agent block [puppet] - 10https://gerrit.wikimedia.org/r/793550 [21:42:01] sorry guys those bgp alerts are due to me. just new peers I've added. [21:42:58] thanks, always good to know stuff us known [21:44:11] yeah my bad, I've disabled LibreNMS alerts now so it won't give out again, will re-enable when I'm done. [21:45:38] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:34] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:58:26] (03CR) 10JHathaway: [C: 03+1] "Looks good overall, just one small question." [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [22:04:26] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 3, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:07:13] !;pg cp3060 idrac interface frozen, rebooted via power outlet control on T243167 [22:07:13] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [22:07:30] !log cp3060 idrac interface frozen, rebooted via power outlet control on T243167 [22:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:44] PROBLEM - Host cloudsw1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [22:08:26] now lets see if cp3060 resurrects [22:08:48] ...its old and anytime you fully remove power and put it back, if any hw is iffy, it dies. [22:08:54] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Active - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:11:08] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:13:18] 10SRE, 10Traffic-Icebox: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10RobH) [22:13:31] 10SRE, 10ops-esams, 10DC-Ops, 10Traffic-Icebox: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) 05Open→03Resolved cp3060's idrac https interface just pulls up and endlessly is 'loading' (see attached screen shot). I tried a racreset command via i... [22:15:33] 10ops-esams, 10DC-Ops: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 (10RobH) p:05Triage→03Medium [22:16:55] 10ops-esams, 10DC-Ops: cp3060 idrac https interface failures - https://phabricator.wikimedia.org/T308797 (10RobH) At the time of this task filing, the nic not getting updated firmware could lead to errors as noted on T243167, even though it hasn't yet for cp3060 specifically. This isn't causing a system/host... [22:18:24] ACKNOWLEDGEMENT - Host cloudsw1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% Cathal Mooney Doing cloudsw migration - just change in loopback - The acknowledgement expires at: 2022-05-20 22:17:54. [22:21:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:22:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [22:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:12] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:23:13] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host netmon1003.mgmt.eqiad.wmnet with reboot policy FORCED [22:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:20] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:26:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host netmon1003.wikimedia.org with OS bullseye [22:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye [22:26:32] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 3, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:33:50] 10SRE, 10GitLab (Auth & Access), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10thcipriani) 05Open→03Resolved a:03brennen [22:43:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:43:44] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:45:32] (03PS1) 10Papaul: Add netmon1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/793557 (https://phabricator.wikimedia.org/T299106) [22:47:00] (03CR) 10Papaul: [C: 03+2] Add netmon1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/793557 (https://phabricator.wikimedia.org/T299106) (owner: 10Papaul) [23:00:34] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:01:18] sigh, my efforts to downtime these routers has not been successful it seems. [23:01:21] let me have anohter look [23:01:52] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:12:09] (03PS1) 10BBlack: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) [23:13:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [23:22:44] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:28:18] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 3, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:37:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host netmon1003.wikimedia.org with OS bullseye [23:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:02] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install netmon1003 - https://phabricator.wikimedia.org/T299106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host netmon1003.wikimedia.org with OS bullseye executed with errors: - netmon1003 (**FAIL**)... [23:38:18] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:48:14] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:59:10] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook