[00:00:14] (03PS1) 10Dzahn: kubernetes::deployment_server: add new service image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/784791 (https://phabricator.wikimedia.org/T251305) [00:04:51] (03PS2) 10Dzahn: kubernetes::deployment_server: add new service image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/784791 (https://phabricator.wikimedia.org/T304891) [00:07:01] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34931/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/784791 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [00:09:01] !log alert1001 - sudo systemctl start certspotter (after an alert from Icinga itself that it failed. error was some temp error fetching data from comodo) [00:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:21] ^ fixed [00:10:55] (03CR) 10Ssingh: P:wikidough: add a check to ensure service has been restarted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [00:12:13] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) >>! In T304891#7823942, @JMeybohm wrote: > We still have those in labs/private `hieradata/common/profile... [00:12:53] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) >>! In T304891#7823946, @Joe wrote: > * The deployment will be called image-suggestion and use the image... [00:21:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [00:21:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [00:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25834 and previous config saved to /var/cache/conftool/dbconfig/20220421-002107-ladsgroup.json [00:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:24:28] (03PS1) 10Dzahn: kubernetes: add dummy tokens for image-suggestion service [labs/private] - 10https://gerrit.wikimedia.org/r/784794 (https://phabricator.wikimedia.org/T304891) [00:24:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:25] (03PS2) 10Dzahn: kubernetes: add dummy tokens for image-suggestion service [labs/private] - 10https://gerrit.wikimedia.org/r/784794 (https://phabricator.wikimedia.org/T304891) [00:30:33] !log alert1001 - sudo systemctl start certspotter - another time, not on our end but should probably fail more gracefully [00:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:04] (03CR) 10Dzahn: [V: 03+2 C: 03+2] kubernetes: add dummy tokens for image-suggestion service [labs/private] - 10https://gerrit.wikimedia.org/r/784794 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [00:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25835 and previous config saved to /var/cache/conftool/dbconfig/20220421-003720-ladsgroup.json [00:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:48:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25836 and previous config saved to /var/cache/conftool/dbconfig/20220421-004846-ladsgroup.json [00:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:52:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25837 and previous config saved to /var/cache/conftool/dbconfig/20220421-005225-ladsgroup.json [00:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:03:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25838 and previous config saved to /var/cache/conftool/dbconfig/20220421-010351-ladsgroup.json [01:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47965 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:07:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25839 and previous config saved to /var/cache/conftool/dbconfig/20220421-010730-ladsgroup.json [01:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25840 and previous config saved to /var/cache/conftool/dbconfig/20220421-011856-ladsgroup.json [01:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25841 and previous config saved to /var/cache/conftool/dbconfig/20220421-012235-ladsgroup.json [01:22:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:22:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25842 and previous config saved to /var/cache/conftool/dbconfig/20220421-013401-ladsgroup.json [01:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:34:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:34:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [01:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25843 and previous config saved to /var/cache/conftool/dbconfig/20220421-013456-ladsgroup.json [01:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:31] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:40:45] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25844 and previous config saved to /var/cache/conftool/dbconfig/20220421-014116-ladsgroup.json [01:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:45:45] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:56:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25845 and previous config saved to /var/cache/conftool/dbconfig/20220421-015621-ladsgroup.json [01:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:51] (03PS5) 10RLazarus: varnish: Rename public_clouds.json to ipblock_cloud.json [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) [02:02:53] (03PS6) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [02:02:55] (03PS1) 10RLazarus: cache: Support multiple requestctl ipblock types in netmapper confd template [puppet] - 10https://gerrit.wikimedia.org/r/784798 (https://phabricator.wikimedia.org/T305581) [02:04:07] (03CR) 10jerkins-bot: [V: 04-1] cache: Support multiple requestctl ipblock types in netmapper confd template [puppet] - 10https://gerrit.wikimedia.org/r/784798 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [02:07:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [02:07:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [02:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25846 and previous config saved to /var/cache/conftool/dbconfig/20220421-020727-ladsgroup.json [02:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:11:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25847 and previous config saved to /var/cache/conftool/dbconfig/20220421-021126-ladsgroup.json [02:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:35] (03PS2) 10RLazarus: cache: Support multiple requestctl ipblock types in netmapper confd template [puppet] - 10https://gerrit.wikimedia.org/r/784798 (https://phabricator.wikimedia.org/T305581) [02:14:37] (03PS7) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [02:15:44] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34932/console" [puppet] - 10https://gerrit.wikimedia.org/r/784798 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [02:19:05] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34933/console" [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [02:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25848 and previous config saved to /var/cache/conftool/dbconfig/20220421-022631-ladsgroup.json [02:26:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [02:26:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [02:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [02:30:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [02:30:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [02:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [02:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:32:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:37:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25849 and previous config saved to /var/cache/conftool/dbconfig/20220421-023710-ladsgroup.json [02:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:39:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25850 and previous config saved to /var/cache/conftool/dbconfig/20220421-023942-ladsgroup.json [02:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:59] (03PS1) 10Andrew Bogott: Make cloudservices2005-dev the new ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/784800 (https://phabricator.wikimedia.org/T304881) [02:51:51] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudservices2005-dev the new ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/784800 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [02:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25851 and previous config saved to /var/cache/conftool/dbconfig/20220421-025849-ladsgroup.json [02:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:59:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:07:38] (03PS8) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [03:07:40] (03PS1) 10Andrew Bogott: acme_chief: add ldap certs for cloudservices200[4,5]-dev [puppet] - 10https://gerrit.wikimedia.org/r/784802 [03:09:02] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: add ldap certs for cloudservices200[4,5]-dev [puppet] - 10https://gerrit.wikimedia.org/r/784802 (owner: 10Andrew Bogott) [03:11:03] (03PS1) 10Andrew Bogott: Make cloudservices2004-dev the new ns0.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/784804 (https://phabricator.wikimedia.org/T304881) [03:13:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25852 and previous config saved to /var/cache/conftool/dbconfig/20220421-031354-ladsgroup.json [03:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:24:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [03:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25853 and previous config saved to /var/cache/conftool/dbconfig/20220421-032503-ladsgroup.json [03:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:25:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:25:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:25:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298563)', diff saved to https://phabricator.wikimedia.org/P25854 and previous config saved to /var/cache/conftool/dbconfig/20220421-032556-ladsgroup.json [03:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:04] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [03:27:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P25855 and previous config saved to /var/cache/conftool/dbconfig/20220421-032753-ladsgroup.json [03:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:28:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25856 and previous config saved to /var/cache/conftool/dbconfig/20220421-032859-ladsgroup.json [03:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:29:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P25857 and previous config saved to /var/cache/conftool/dbconfig/20220421-032906-ladsgroup.json [03:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:14] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:29:58] (03PS3) 10Ladsgroup: Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) [03:31:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25858 and previous config saved to /var/cache/conftool/dbconfig/20220421-033154-ladsgroup.json [03:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:39:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:40:02] (03PS9) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [03:40:04] (03PS1) 10Andrew Bogott: Add new designate hosts to profile::openstack::codfw1dev::designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/784806 [03:40:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P25859 and previous config saved to /var/cache/conftool/dbconfig/20220421-034021-ladsgroup.json [03:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:28] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:40:55] (03CR) 10Andrew Bogott: [C: 03+2] Add new designate hosts to profile::openstack::codfw1dev::designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/784806 (owner: 10Andrew Bogott) [03:44:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25860 and previous config saved to /var/cache/conftool/dbconfig/20220421-034404-ladsgroup.json [03:44:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [03:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [03:44:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:44:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db[2074,2094,2109,2127,2149].codfw.wmnet with reason: Maintenance [03:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db[2074,2094,2109,2127,2149].codfw.wmnet with reason: Maintenance [03:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25861 and previous config saved to /var/cache/conftool/dbconfig/20220421-034659-ladsgroup.json [03:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P25862 and previous config saved to /var/cache/conftool/dbconfig/20220421-035526-ladsgroup.json [03:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:08] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudservices2004-dev the new ns0.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/784804 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [03:57:28] (03CR) 10Andrew Bogott: [C: 03+2] Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [03:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25863 and previous config saved to /var/cache/conftool/dbconfig/20220421-040204-ladsgroup.json [04:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:07] (03PS1) 10Andrew Bogott: misc hiera changes to add cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/784807 (https://phabricator.wikimedia.org/T304881) [04:08:32] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:09:10] PROBLEM - SSH on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:09:32] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P25864 and previous config saved to /var/cache/conftool/dbconfig/20220421-041032-ladsgroup.json [04:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:40] (03CR) 10Andrew Bogott: [C: 03+2] misc hiera changes to add cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/784807 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [04:11:04] RECOVERY - SSH on ms-fe1012 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:11:11] (03PS2) 10Andrew Bogott: misc hiera changes to add cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/784807 (https://phabricator.wikimedia.org/T304881) [04:11:28] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift [04:12:09] (03CR) 10Andrew Bogott: [C: 03+2] misc hiera changes to add cloudweb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/784807 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [04:12:34] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [04:17:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25865 and previous config saved to /var/cache/conftool/dbconfig/20220421-041710-ladsgroup.json [04:17:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [04:17:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [04:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:36] PROBLEM - Check systemd state on cloudweb2002-dev is CRITICAL: CRITICAL - degraded: The following units failed: mcrouter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:21:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25866 and previous config saved to /var/cache/conftool/dbconfig/20220421-042142-ladsgroup.json [04:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P25867 and previous config saved to /var/cache/conftool/dbconfig/20220421-042537-ladsgroup.json [04:25:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:25:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [04:25:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P25868 and previous config saved to /var/cache/conftool/dbconfig/20220421-042545-ladsgroup.json [04:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:56] RECOVERY - Check systemd state on cloudweb2002-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25869 and previous config saved to /var/cache/conftool/dbconfig/20220421-043014-ladsgroup.json [04:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:37:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:37:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:46] PROBLEM - Memcached on cloudweb2002-dev is CRITICAL: connect to address 208.80.153.41 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [04:45:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25870 and previous config saved to /var/cache/conftool/dbconfig/20220421-044519-ladsgroup.json [04:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:53] (03PS1) 10Andrew Bogott: Prepare cloudservices200[2,3]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/784810 (https://phabricator.wikimedia.org/T304881) [04:49:50] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudservices200[2,3]-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/784810 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [05:00:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25871 and previous config saved to /var/cache/conftool/dbconfig/20220421-050024-ladsgroup.json [05:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s8 T303927 [05:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:18] T303927: Switchover s8 master (db1109 -> db1104) - https://phabricator.wikimedia.org/T303927 [05:01:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s8 T303927 [05:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1104 with weight 0 T303927', diff saved to https://phabricator.wikimedia.org/P25872 and previous config saved to /var/cache/conftool/dbconfig/20220421-050154-ladsgroup.json [05:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:41] (03CR) 10Marostegui: [C: 03+2] monitor_eventscheduler.pp: Monitor event_scheduler on tests hosts [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:02:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:07:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:09:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P25873 and previous config saved to /var/cache/conftool/dbconfig/20220421-050918-ladsgroup.json [05:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:24] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [05:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 T301879', diff saved to https://phabricator.wikimedia.org/P25874 and previous config saved to /var/cache/conftool/dbconfig/20220421-050931-marostegui.json [05:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:36] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [05:10:18] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784811 [05:12:56] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784811 (owner: 10Marostegui) [05:15:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25875 and previous config saved to /var/cache/conftool/dbconfig/20220421-051529-ladsgroup.json [05:15:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [05:15:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:15:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25876 and previous config saved to /var/cache/conftool/dbconfig/20220421-051543-ladsgroup.json [05:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:48] (03PS3) 10Ladsgroup: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T303927) [05:17:52] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [05:19:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [05:21:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:21:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25877 and previous config saved to /var/cache/conftool/dbconfig/20220421-052146-ladsgroup.json [05:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:22:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:23:43] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.371 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P25878 and previous config saved to /var/cache/conftool/dbconfig/20220421-052423-ladsgroup.json [05:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:45] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:25:17] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:26:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:59] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:31:49] PROBLEM - Check systemd state on ms-be2046 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:53] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2046 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:33:47] (03PS1) 10Andrew Bogott: Update dns IP for new cloudservices hosts in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/785067 [05:34:35] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms [05:37:45] (03CR) 10Andrew Bogott: [C: 03+2] Update dns IP for new cloudservices hosts in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/785067 (owner: 10Andrew Bogott) [05:39:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P25879 and previous config saved to /var/cache/conftool/dbconfig/20220421-053928-ladsgroup.json [05:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:54:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P25880 and previous config saved to /var/cache/conftool/dbconfig/20220421-055433-ladsgroup.json [05:54:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [05:54:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [05:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:39] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [05:54:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P25881 and previous config saved to /var/cache/conftool/dbconfig/20220421-055441-ladsgroup.json [05:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P25882 and previous config saved to /var/cache/conftool/dbconfig/20220421-055655-ladsgroup.json [05:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:05] Amir1: Which part do you want me to do? [05:57:15] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1006.eqiad.wmnet with OS bullseye [05:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:46] marostegui: maybe the after bits would be okay [05:58:52] ok [05:59:20] you do the query killers thing [05:59:26] as you'll have the output on the screen :p [05:59:31] anyways, let's go for it [06:00:05] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T0600). [06:00:08] o/ [06:00:10] awesome [06:00:12] let's go [06:00:14] !log Starting s8 eqiad failover from db1109 to db1104 - T303927 [06:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:20] T303927: Switchover s8 master (db1109 -> db1104) - https://phabricator.wikimedia.org/T303927 [06:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T303927', diff saved to https://phabricator.wikimedia.org/P25883 and previous config saved to /var/cache/conftool/dbconfig/20220421-060023-ladsgroup.json [06:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:39] RO confirmed [06:01:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1104 to s8 primary and set section read-write T303927', diff saved to https://phabricator.wikimedia.org/P25884 and previous config saved to /var/cache/conftool/dbconfig/20220421-060106-ladsgroup.json [06:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:17] heartbeat cleaned [06:01:32] I can write [06:01:37] done [06:02:47] (03PS2) 10Marostegui: wmnet: Update s8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/784678 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [06:03:11] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2046 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:03:14] marostegui: how do you measure the RO time? [06:03:25] Amir1: dbctl times [06:03:32] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/784678 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [06:03:39] Amir1: let's finish the pending things first [06:04:33] Amir1: before updating the task, refresh it, as you are reverting my changes :p [06:04:45] you're too fast 🤬 [06:05:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 T303927', diff saved to https://phabricator.wikimedia.org/P25885 and previous config saved to /var/cache/conftool/dbconfig/20220421-060512-root.json [06:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:39] so only checking zarcillo left? [06:05:42] I did that [06:05:53] awesome [06:06:02] I let you have fun with db1109 [06:06:04] Amir1: I have depooled the old master for the schema changes [06:06:05] (03PS1) 10Marostegui: db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785071 (https://phabricator.wikimedia.org/T303927) [06:06:10] But we need to repool it before the weekend [06:06:24] Amir1: I don't have any schema changes pending for db1109, you do :p [06:06:29] I can repool it once I wake up, does that sound good to you? [06:06:36] sounds good [06:06:40] remember to revert: https://gerrit.wikimedia.org/r/c/operations/puppet/+/785071/ [06:06:48] (03CR) 10Marostegui: [C: 03+2] db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785071 (https://phabricator.wikimedia.org/T303927) (owner: 10Marostegui) [06:07:02] I honestly can't keep track of schema changes I'm doing :P [06:07:05] let me see [06:07:07] you have the RO times? [06:07:15] You can just add them to the task and close it [06:08:19] hmm, only T300381 is pending? [06:08:20] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:08:42] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1006.eqiad.wmnet with reason: host reimage [06:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:52] Amir1: Ah, I thought you had something else [06:08:56] I will get that started now [06:09:15] I don't remember having any [06:09:49] I started this new thing last night but I ran it on master [06:09:51] Just started it [06:09:59] Anyways, I am going to get breakfast [06:10:03] T306560 [06:10:04] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:10:16] Close the switchover task whenever you like (add the RO times) [06:10:25] sure [06:11:37] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1006.eqiad.wmnet with reason: host reimage [06:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P25886 and previous config saved to /var/cache/conftool/dbconfig/20220421-061200-ladsgroup.json [06:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:31] done [06:13:39] I go rest [06:15:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25887 and previous config saved to /var/cache/conftool/dbconfig/20220421-061558-ladsgroup.json [06:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:20:32] RECOVERY - Check systemd state on ms-be2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25888 and previous config saved to /var/cache/conftool/dbconfig/20220421-062201-ladsgroup.json [06:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:25:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:27:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P25889 and previous config saved to /var/cache/conftool/dbconfig/20220421-062705-ladsgroup.json [06:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:31] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1006.eqiad.wmnet with OS bullseye [06:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25890 and previous config saved to /var/cache/conftool/dbconfig/20220421-063103-ladsgroup.json [06:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:02] (03PS1) 10Majavah: P:openstack::rabbitmq: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/785105 (https://phabricator.wikimedia.org/T297268) [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:33:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34934/console" [puppet] - 10https://gerrit.wikimedia.org/r/785105 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [06:34:00] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2006.codfw.wmnet with OS bullseye [06:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25891 and previous config saved to /var/cache/conftool/dbconfig/20220421-063706-ladsgroup.json [06:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P25892 and previous config saved to /var/cache/conftool/dbconfig/20220421-064210-ladsgroup.json [06:42:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:42:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:16] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:42:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25893 and previous config saved to /var/cache/conftool/dbconfig/20220421-064245-ladsgroup.json [06:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25894 and previous config saved to /var/cache/conftool/dbconfig/20220421-064514-ladsgroup.json [06:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25895 and previous config saved to /var/cache/conftool/dbconfig/20220421-064608-ladsgroup.json [06:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:21] (03PS3) 10Elukey: Add four new k8s worker nodes to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) [06:52:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25896 and previous config saved to /var/cache/conftool/dbconfig/20220421-065211-ladsgroup.json [06:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, apergos, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T0700). [07:00:14] hello. [07:00:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P25897 and previous config saved to /var/cache/conftool/dbconfig/20220421-070019-ladsgroup.json [07:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:29] no patches in the window. no trainees signed up. [07:00:53] if anyone wants to self deploy something, add yourself to https://wikitech.wikimedia.org/wiki/Deployments and get it done, now's the tmie. [07:01:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25898 and previous config saved to /var/cache/conftool/dbconfig/20220421-070113-ladsgroup.json [07:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:02:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25899 and previous config saved to /var/cache/conftool/dbconfig/20220421-070208-ladsgroup.json [07:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:40] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2006.codfw.wmnet with reason: host reimage [07:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:18] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Volans) Sorry, I did overlooked the request, as your account is with an `@wikimedia.org` email account I've granted you the `wmf` group in LDAP and revoked the `nda` one as they can't cohexist. But don't... [07:06:13] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2006.codfw.wmnet with reason: host reimage [07:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25900 and previous config saved to /var/cache/conftool/dbconfig/20220421-070716-ladsgroup.json [07:07:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:07:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:07:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25901 and previous config saved to /var/cache/conftool/dbconfig/20220421-070729-ladsgroup.json [07:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25902 and previous config saved to /var/cache/conftool/dbconfig/20220421-070744-ladsgroup.json [07:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P25903 and previous config saved to /var/cache/conftool/dbconfig/20220421-071524-ladsgroup.json [07:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:52] (03CR) 10Muehlenhoff: admin: update samtar account (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) (owner: 10Volans) [07:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25904 and previous config saved to /var/cache/conftool/dbconfig/20220421-072249-ladsgroup.json [07:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25905 and previous config saved to /var/cache/conftool/dbconfig/20220421-073029-ladsgroup.json [07:30:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:30:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [07:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:35] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:30:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25906 and previous config saved to /var/cache/conftool/dbconfig/20220421-073037-ladsgroup.json [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25907 and previous config saved to /var/cache/conftool/dbconfig/20220421-073306-ladsgroup.json [07:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25908 and previous config saved to /var/cache/conftool/dbconfig/20220421-073755-ladsgroup.json [07:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:57] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable Kartographer nearby feature on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784702 (https://phabricator.wikimedia.org/T304076) (owner: 10WMDE-Fisch) [07:44:57] (03PS1) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [07:46:45] (03PS1) 10Elukey: profile::ores::web: allow undef passwords for redis [puppet] - 10https://gerrit.wikimedia.org/r/785111 [07:48:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P25909 and previous config saved to /var/cache/conftool/dbconfig/20220421-074811-ladsgroup.json [07:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:43] (03CR) 10Elukey: [C: 03+2] profile::ores::web: allow undef passwords for redis [puppet] - 10https://gerrit.wikimedia.org/r/785111 (owner: 10Elukey) [07:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25910 and previous config saved to /var/cache/conftool/dbconfig/20220421-075300-ladsgroup.json [07:53:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:53:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:53:09] (03PS2) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [07:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:53:28] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2006.codfw.wmnet with OS bullseye [07:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34937/console" [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [07:53:58] (03PS2) 10Volans: admin: update samtar account [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) [07:54:06] (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) (owner: 10Volans) [07:55:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) (owner: 10Volans) [07:55:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:38] (03CR) 10Volans: [C: 03+2] admin: update samtar account [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) (owner: 10Volans) [07:55:47] (03PS3) 10Volans: admin: update samtar account [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) [07:57:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:57:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [07:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25911 and previous config saved to /var/cache/conftool/dbconfig/20220421-075734-ladsgroup.json [07:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P25912 and previous config saved to /var/cache/conftool/dbconfig/20220421-080316-ladsgroup.json [08:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:38] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25913 and previous config saved to /var/cache/conftool/dbconfig/20220421-080420-ladsgroup.json [08:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:07:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25914 and previous config saved to /var/cache/conftool/dbconfig/20220421-080744-ladsgroup.json [08:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/784742 (owner: 10Vivian Rook) [08:11:14] (03CR) 10Svantje Lilienthal: [C: 03+1] [beta] Enable Kartographer nearby feature on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784702 (https://phabricator.wikimedia.org/T304076) (owner: 10WMDE-Fisch) [08:11:44] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1005.eqiad.wmnet with OS bullseye [08:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:44] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: populate target index format and add pipeline diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [08:18:02] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: rewrite ecs settings [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [08:18:14] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: transform rotation frequency values to datestamp format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [08:18:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25915 and previous config saved to /var/cache/conftool/dbconfig/20220421-081821-ladsgroup.json [08:18:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:18:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:28] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:18:29] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [08:18:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25916 and previous config saved to /var/cache/conftool/dbconfig/20220421-081829-ladsgroup.json [08:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:24] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10KSiebert) @Volans Hey there, I am Sammy's manager and am approving! [08:19:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25917 and previous config saved to /var/cache/conftool/dbconfig/20220421-081925-ladsgroup.json [08:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25918 and previous config saved to /var/cache/conftool/dbconfig/20220421-082249-ladsgroup.json [08:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:32] !nowandnext [08:23:37] (03PS1) 10Muehlenhoff: Apply role::webperf::processors_and_site to webperf1003/2003 [puppet] - 10https://gerrit.wikimedia.org/r/785115 (https://phabricator.wikimedia.org/T305460) [08:23:39] (03PS1) 10Muehlenhoff: Switch webperf1001/1003 for eventual removal [puppet] - 10https://gerrit.wikimedia.org/r/785116 (https://phabricator.wikimedia.org/T205460) [08:23:49] now [08:24:05] jouncebot: now [08:24:05] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [08:24:23] * WMDE-Fisch merging a beta cluster config change [08:24:56] (03CR) 10WMDE-Fisch: [C: 03+2] [beta] Enable Kartographer nearby feature on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784702 (https://phabricator.wikimedia.org/T304076) (owner: 10WMDE-Fisch) [08:25:37] (03Merged) 10jenkins-bot: [beta] Enable Kartographer nearby feature on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784702 (https://phabricator.wikimedia.org/T304076) (owner: 10WMDE-Fisch) [08:25:48] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1005.eqiad.wmnet with reason: host reimage [08:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:31] (03PS1) 10Muehlenhoff: Extend Ferm rules for new webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785117 (https://phabricator.wikimedia.org/T305460) [08:27:34] (03PS1) 10Muehlenhoff: Remove obsolete webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785118 (https://phabricator.wikimedia.org/T305460) [08:28:07] * WMDE-Fisch done [08:29:13] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1005.eqiad.wmnet with reason: host reimage [08:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:30:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:22] (03PS1) 10KartikMistry: Update cxserver to 2022-04-21-081331-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/785120 (https://phabricator.wikimedia.org/T305115) [08:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25919 and previous config saved to /var/cache/conftool/dbconfig/20220421-083430-ladsgroup.json [08:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:28] (03PS5) 10Jcrespo: admin: Add placeholder to reserve uid and gid 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) [08:37:12] (03CR) 10Jcrespo: [C: 03+2] admin: Add placeholder to reserve uid and gid 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [08:37:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25920 and previous config saved to /var/cache/conftool/dbconfig/20220421-083754-ladsgroup.json [08:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:09] (03CR) 10Volans: "I've left some comment/question inline." [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [08:48:49] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1005.eqiad.wmnet with OS bullseye [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25921 and previous config saved to /var/cache/conftool/dbconfig/20220421-084935-ladsgroup.json [08:49:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:49:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25922 and previous config saved to /var/cache/conftool/dbconfig/20220421-084943-ladsgroup.json [08:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) 05Open→03Resolved a:03Volans @KSiebert thanks, it's all done. There was some confusion based on which email should the account be associated with. @TheresNoTime I'm res... [08:52:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25923 and previous config saved to /var/cache/conftool/dbconfig/20220421-085214-ladsgroup.json [08:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25924 and previous config saved to /var/cache/conftool/dbconfig/20220421-085259-ladsgroup.json [08:53:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:53:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25925 and previous config saved to /var/cache/conftool/dbconfig/20220421-085307-ladsgroup.json [08:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:12] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2005.codfw.wmnet with OS bullseye [08:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:43] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1004.eqiad.wmnet with OS bullseye [08:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:04:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:06:51] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2005.codfw.wmnet with reason: host reimage [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:11] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1004.eqiad.wmnet with reason: host reimage [09:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2005.codfw.wmnet with reason: host reimage [09:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:46] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1004.eqiad.wmnet with reason: host reimage [09:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:13] (03PS1) 10Jcrespo: Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 [09:14:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 (owner: 10Jcrespo) [09:15:57] (03PS2) 10Jcrespo: Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 [09:18:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25926 and previous config saved to /var/cache/conftool/dbconfig/20220421-091843-ladsgroup.json [09:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:49] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [09:25:05] 10SRE, 10Traffic: haproxy tls terminator autobanning - https://phabricator.wikimedia.org/T306580 (10Volans) p:05Triage→03Medium [09:26:34] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Volans) p:05Triage→03Medium [09:28:39] 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10Volans) p:05Triage→03Medium [09:32:54] Krinkle: when you have a moment could you set the priority of T305794, thanks [09:32:54] T305794: Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794 [09:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P25927 and previous config saved to /var/cache/conftool/dbconfig/20220421-093348-ladsgroup.json [09:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:30] 10SRE, 10conftool: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Vgutierrez) p:05Triage→03Medium [09:35:36] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1004.eqiad.wmnet with OS bullseye [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:17] (03PS1) 10Elukey: ores: support Celery 4 and 5 configurations [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) [09:37:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:37:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:28] !log upgrading the Ganeti test cluster to 3.0 T306499 [09:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:33] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [09:41:47] (03PS2) 10Elukey: ores: support Celery 4 and 5 configurations [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) [09:41:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:41:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:41:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [09:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:59] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2005.codfw.wmnet with OS bullseye [09:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [09:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:43:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:01] (03PS3) 10Elukey: ores: support Celery 4 and 5 configurations [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) [09:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:46:52] (03PS4) 10Elukey: ores: support Celery 4 and 5 configurations [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:48:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:48:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25928 and previous config saved to /var/cache/conftool/dbconfig/20220421-094807-ladsgroup.json [09:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:48:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P25929 and previous config saved to /var/cache/conftool/dbconfig/20220421-094853-ladsgroup.json [09:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34941/console" [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:51:57] (03PS2) 10Vgutierrez: cache::haproxy: Log emergency messages to disk [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) [09:52:17] (03CR) 10Elukey: [V: 03+1 C: 03+2] ores: support Celery 4 and 5 configurations [puppet] - 10https://gerrit.wikimedia.org/r/785124 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:52:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1002.eqiad.wmnet [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25930 and previous config saved to /var/cache/conftool/dbconfig/20220421-095322-ladsgroup.json [09:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1002.eqiad.wmnet [09:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:03] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [09:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25931 and previous config saved to /var/cache/conftool/dbconfig/20220421-095529-ladsgroup.json [09:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:20] 10SRE, 10Analytics, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @Vgutierrez for bringing this to our attention. I agree that we should try to find the cause of these errors and eradicate it if at all... [09:57:59] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) a:03BTullis [10:00:05] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1000). [10:03:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P25932 and previous config saved to /var/cache/conftool/dbconfig/20220421-100359-ladsgroup.json [10:04:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:04:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:05] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [10:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:15] (03PS1) 10Muehlenhoff: Fix up host globbing for ping servers [puppet] - 10https://gerrit.wikimedia.org/r/785125 [10:04:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:04:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [10:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [10:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25933 and previous config saved to /var/cache/conftool/dbconfig/20220421-100827-ladsgroup.json [10:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:28] (03PS1) 10Marostegui: Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/784716 [10:10:31] (03CR) 10Marostegui: [C: 03+2] Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/784716 (owner: 10Marostegui) [10:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25934 and previous config saved to /var/cache/conftool/dbconfig/20220421-101034-ladsgroup.json [10:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P25935 and previous config saved to /var/cache/conftool/dbconfig/20220421-101127-root.json [10:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2002.codfw.wmnet [10:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34942/console" [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [10:17:44] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) @SCherukuwada I've asked some of the people in charge of operational security at SRE and they advised that the easiest way to handle expiration is to mo... [10:18:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2002.codfw.wmnet [10:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25936 and previous config saved to /var/cache/conftool/dbconfig/20220421-102332-ladsgroup.json [10:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25937 and previous config saved to /var/cache/conftool/dbconfig/20220421-102539-ladsgroup.json [10:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:27] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) [10:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P25938 and previous config saved to /var/cache/conftool/dbconfig/20220421-102631-root.json [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:01] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [10:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:07] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2004.codfw.wmnet with OS bullseye [10:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25939 and previous config saved to /var/cache/conftool/dbconfig/20220421-103837-ladsgroup.json [10:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:40:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25940 and previous config saved to /var/cache/conftool/dbconfig/20220421-104044-ladsgroup.json [10:40:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:40:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [10:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25941 and previous config saved to /var/cache/conftool/dbconfig/20220421-104057-ladsgroup.json [10:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P25942 and previous config saved to /var/cache/conftool/dbconfig/20220421-104135-root.json [10:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:00] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) > I will double check in case there is some pending expiration in the current calendar, These are the ones that should have been acted on (probably you... [10:46:08] (03PS1) 10Roman Stolar: Remove Thumbor Community Core as Wikimedia Thumbor dependency [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) [10:46:59] (03CR) 10jerkins-bot: [V: 04-1] Remove Thumbor Community Core as Wikimedia Thumbor dependency [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) (owner: 10Roman Stolar) [10:48:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping3002.esams.wmnet [10:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:38] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2004.codfw.wmnet with reason: host reimage [10:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping3002.esams.wmnet [10:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:47] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2004.codfw.wmnet with reason: host reimage [10:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:55:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P25944 and previous config saved to /var/cache/conftool/dbconfig/20220421-105638-root.json [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:39] (03PS2) 10Roman Stolar: Remove Thumbor Community Core as Wikimedia Thumbor dependency [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) [10:58:00] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25945 and previous config saved to /var/cache/conftool/dbconfig/20220421-105835-root.json [10:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:29] (03CR) 10Vgutierrez: "@fgiunchedi this is currently being used on traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud and it seems to be working as expe" [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) (owner: 10Vgutierrez) [11:01:16] PROBLEM - traffic_server backend process restarted on cp2036 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2036&var-layer=backend [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:44] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:43] (03PS1) 10Majavah: add dummy password for cloudinfra token validator [labs/private] - 10https://gerrit.wikimedia.org/r/785128 (https://phabricator.wikimedia.org/T274666) [11:05:45] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [11:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:44] (03PS1) 10Marostegui: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785133 (https://phabricator.wikimedia.org/T306604) [11:11:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P25946 and previous config saved to /var/cache/conftool/dbconfig/20220421-111144-root.json [11:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:05] (03PS1) 10Majavah: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) [11:13:35] !log dbmaint s2@codfw T306604 [11:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25947 and previous config saved to /var/cache/conftool/dbconfig/20220421-111340-root.json [11:13:41] T306604: db2088 filling up - https://phabricator.wikimedia.org/T306604 [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:12] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2004.codfw.wmnet with OS bullseye [11:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:44] * kart_ updating cxserver.. [11:16:01] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-04-21-081331-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/785120 (https://phabricator.wikimedia.org/T305115) (owner: 10KartikMistry) [11:19:24] (03CR) 10Marostegui: [C: 03+2] db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/785133 (https://phabricator.wikimedia.org/T306604) (owner: 10Marostegui) [11:19:38] PROBLEM - Check systemd state on cp2036 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_nagios-nrpe-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:14] (03Merged) 10jenkins-bot: Update cxserver to 2022-04-21-081331-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/785120 (https://phabricator.wikimedia.org/T305115) (owner: 10KartikMistry) [11:22:44] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:17] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P25948 and previous config saved to /var/cache/conftool/dbconfig/20220421-112648-root.json [11:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:02] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:02] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25949 and previous config saved to /var/cache/conftool/dbconfig/20220421-112843-root.json [11:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:54] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:54] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:36] !log Updated cxserver to 2022-04-21-081331-production (T287655, T304855, T304862, T304866, T305115) [11:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:47] T287655: Generate template parameter alignments for en > de wikis - https://phabricator.wikimedia.org/T287655 [11:34:47] T304862: Enable Content and Section Translation for Basque Wikipedia - https://phabricator.wikimedia.org/T304862 [11:34:47] T304855: Enable Content and Section Translation for Czech Wikipedia - https://phabricator.wikimedia.org/T304855 [11:34:47] T305115: Generate template parameter alignments for wikis (April-June) - https://phabricator.wikimedia.org/T305115 [11:34:48] T304866: Enable Content and Section Translation for Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T304866 [11:35:23] !log installing zlib security updates on stretch (buster/bullseye already fixed) [11:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] PROBLEM - Varnish frontend child restarted on cp2036 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp2036&var-datasource=codfw+prometheus/ops [11:39:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:39:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25950 and previous config saved to /var/cache/conftool/dbconfig/20220421-114112-ladsgroup.json [11:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:43:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25951 and previous config saved to /var/cache/conftool/dbconfig/20220421-114347-root.json [11:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:43] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25952 and previous config saved to /var/cache/conftool/dbconfig/20220421-115617-ladsgroup.json [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25953 and previous config saved to /var/cache/conftool/dbconfig/20220421-115851-root.json [11:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:59:15] RECOVERY - Check systemd state on cp2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:53] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) My suspicion is that these workers need more CPU and/or memory. We recently doubled the number of replica... [12:01:33] (03PS1) 10Jelto: site: use appserver in codfw C3, cleanup duplicate insetup definition [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) [12:06:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 (10ssingh) p:05Triage→03Medium [12:08:44] (03CR) 10Jelto: "mw2412 to mw2419 have role insetup and are not pooled. They should have a proper role assigned because racking and os installation happene" [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [12:10:54] !log installing subversion security updates [12:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25954 and previous config saved to /var/cache/conftool/dbconfig/20220421-121122-ladsgroup.json [12:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:21] (03CR) 10Vivian Rook: [C: 03+2] Add Vivian Rook to icinga [puppet] - 10https://gerrit.wikimedia.org/r/784742 (owner: 10Vivian Rook) [12:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25955 and previous config saved to /var/cache/conftool/dbconfig/20220421-121355-root.json [12:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:06] (03PS2) 10Jelto: site: use appserver in codfw C3, cleanup duplicate insetup definition [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) [12:16:35] (03PS1) 10Muehlenhoff: No longer install subversion on Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/785149 [12:19:53] (03PS1) 10Btullis: Increase the RAM request and limit for eventgate pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) [12:20:30] !log installing openjpeg2 security updates [12:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:07] Moritzm: svn is used [12:23:09] https://phabricator.wikimedia.org/diffusion/query/NCbVBYAxI8aR/#R [12:23:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:24:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:24:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [12:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [12:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:26] (03CR) 10RhinosF1: "uses https://phabricator.wikimedia.org/diffusion/query/NCbVBYAxI8aR/#R" [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [12:24:28] (03CR) 10Btullis: "I'm hoping that this RAM increase will reduce the error rate affecting intake-analytics.wikimedia.org as well as reduce the apparent laten" [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [12:24:39] RhinosF1: oh, I missed that! I'll abandon, then. thanks [12:25:01] !log installing flac security updates [12:25:04] moritzm: np [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25956 and previous config saved to /var/cache/conftool/dbconfig/20220421-122627-ladsgroup.json [12:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:27:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:27:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [12:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25957 and previous config saved to /var/cache/conftool/dbconfig/20220421-122722-ladsgroup.json [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P25958 and previous config saved to /var/cache/conftool/dbconfig/20220421-122859-root.json [12:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:21] !log installing fribidi security updates [12:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25959 and previous config saved to /var/cache/conftool/dbconfig/20220421-123347-ladsgroup.json [12:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:34:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor2003.codfw.wmnet [12:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2003.codfw.wmnet [12:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor2004.codfw.wmnet [12:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25960 and previous config saved to /var/cache/conftool/dbconfig/20220421-124852-ladsgroup.json [12:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:38] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:50:55] (03CR) 10Ayounsi: Fix up host globbing for ping servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785125 (owner: 10Muehlenhoff) [12:55:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2004.codfw.wmnet [12:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:30] (03CR) 10Muehlenhoff: Fix up host globbing for ping servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785125 (owner: 10Muehlenhoff) [12:55:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor2005.codfw.wmnet [12:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:42] (03Abandoned) 10Muehlenhoff: No longer install subversion on Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:20] let me steal the window [13:00:33] (03PS2) 10Urbanecm: plwiki: Fix cascading protection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784619 (https://phabricator.wikimedia.org/T306300) [13:00:38] (03CR) 10Urbanecm: [C: 03+2] plwiki: Fix cascading protection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784619 (https://phabricator.wikimedia.org/T306300) (owner: 10Urbanecm) [13:01:21] (03Merged) 10jenkins-bot: plwiki: Fix cascading protection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784619 (https://phabricator.wikimedia.org/T306300) (owner: 10Urbanecm) [13:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:02:47] !log restart ats-be and varnish-fe on cp2036 to clear restarted service alerts [13:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7d5114e80567663cad7415e985fdb8191ef9d4b6: plwiki: Fix cascading protection configuration (T306300) (duration: 00m 55s) [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:36] * urbanecm done [13:03:36] T306300: Fix $wgCascadingRestrictionLevels for plwiki - https://phabricator.wikimedia.org/T306300 [13:03:50] RECOVERY - Varnish frontend child restarted on cp2036 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp2036&var-datasource=codfw+prometheus/ops [13:03:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25961 and previous config saved to /var/cache/conftool/dbconfig/20220421-130357-ladsgroup.json [13:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:32] RECOVERY - traffic_server backend process restarted on cp2036 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2036&var-layer=backend [13:04:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2005.codfw.wmnet [13:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:45] (03Restored) 10Dzahn: No longer install subversion on Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [13:08:47] (03CR) 10Dzahn: "Let me use this as a reminder to find out if we can remove both. I was currently trying to shutdown git repos on Phabricator anyways. So w" [puppet] - 10https://gerrit.wikimedia.org/r/785149 (owner: 10Muehlenhoff) [13:08:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:08:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor2006.codfw.wmnet [13:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) Thanks @Cmjohnson Note the IP addresses assigned to the servers need to be updated to match those vlans. [13:14:18] (03PS1) 10Jcrespo: mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) [13:14:35] (03PS2) 10Jcrespo: mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) [13:15:37] (03PS3) 10Jcrespo: mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) [13:17:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [13:17:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [13:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25962 and previous config saved to /var/cache/conftool/dbconfig/20220421-131713-ladsgroup.json [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25963 and previous config saved to /var/cache/conftool/dbconfig/20220421-131902-ladsgroup.json [13:19:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:19:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2006.codfw.wmnet [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:23:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:23:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [13:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [13:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:25:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10Volans) Thanks for opening the task to discuss details. As the first feedback I've a primary question that is how you envision this new third way... [13:26:32] jouncebot: nowandnext [13:26:32] For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1300) [13:26:32] In 2 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1600) [13:27:06] ah, perfect - /me deploying an interwiki cache update [13:27:16] (03PS4) 10Jcrespo: mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) [13:27:24] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis) p:05Triage→03High [13:27:25] (03PS5) 10Jcrespo: mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) [13:27:30] 10SRE, 10SRE-OnFire, 10observability, 10I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896 (10CDanis) p:05Triage→03Medium [13:28:33] (03PS1) 10Majavah: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785157 [13:28:46] (03CR) 10Majavah: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785157 (owner: 10Majavah) [13:29:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:29:29] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785157 (owner: 10Majavah) [13:29:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25964 and previous config saved to /var/cache/conftool/dbconfig/20220421-132935-ladsgroup.json [13:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:30:54] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Preconfigure mc client config on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/785156 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [13:31:39] !log taavi@deploy1002 Synchronized wmf-config/interwiki.php: Config: [[gerrit:785157|Update interwiki cache]] (duration: 00m 51s) [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25965 and previous config saved to /var/cache/conftool/dbconfig/20220421-133204-ladsgroup.json [13:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:38] * taavi done [13:34:14] (03PS4) 10Elukey: Add four new k8s worker nodes to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) [13:34:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] (03CR) 10Ayounsi: [C: 03+1] Fix up host globbing for ping servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785125 (owner: 10Muehlenhoff) [13:36:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34945/console" [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [13:38:04] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add four new k8s worker nodes to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [13:40:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) Yeah, I'm expecting Netbox to always be the source of truth so a homer run after a spicerack run would be a NOOP. `junos-eznc` is what I... [13:45:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:52] (03PS2) 10Muehlenhoff: Fix up host globbing for ping servers [puppet] - 10https://gerrit.wikimedia.org/r/785125 [13:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:49:22] (03PS1) 10Jcrespo: mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) [13:50:05] (03PS2) 10Jcrespo: mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) [13:50:40] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [13:54:40] !log powercycling thumbor1001, stuck on reboot [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [13:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:46] (03PS3) 10Jcrespo: mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) [13:55:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1020.mgmt.eqiad.wmnet with reboot policy FORCED [13:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25966 and previous config saved to /var/cache/conftool/dbconfig/20220421-135621-ladsgroup.json [13:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:58:13] (03PS4) 10Jcrespo: mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) [13:58:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1120.eqiad.wmnet with reason: Rebooting for T303174 [13:58:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1120.eqiad.wmnet with reason: Rebooting for T303174 [13:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1120 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25967 and previous config saved to /var/cache/conftool/dbconfig/20220421-135831-kormat.json [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:53] (03PS1) 10Gergő Tisza: [beta] WelcomeSurveyExperimentalGroups: Use enwiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785165 (https://phabricator.wikimedia.org/T303240) [13:59:15] (03PS2) 10Gergő Tisza: [beta] WelcomeSurveyExperimentalGroups: Use enwiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785165 (https://phabricator.wikimedia.org/T303240) [13:59:39] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Fix formatting and syntax error on mc config template [puppet] - 10https://gerrit.wikimedia.org/r/785161 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [13:59:49] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] add dummy password for cloudinfra token validator [labs/private] - 10https://gerrit.wikimedia.org/r/785128 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:02:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1001.eqiad.wmnet [14:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:54] (03CR) 10Gergő Tisza: [C: 03+2] [beta] WelcomeSurveyExperimentalGroups: Use enwiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785165 (https://phabricator.wikimedia.org/T303240) (owner: 10Gergő Tisza) [14:03:02] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1152.eqiad.wmnet with reason: Rebooting for T303174 [14:03:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1152.eqiad.wmnet with reason: Rebooting for T303174 [14:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db1152 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25968 and previous config saved to /var/cache/conftool/dbconfig/20220421-140309-kormat.json [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:33] (03Merged) 10jenkins-bot: [beta] WelcomeSurveyExperimentalGroups: Use enwiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785165 (https://phabricator.wikimedia.org/T303240) (owner: 10Gergő Tisza) [14:05:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1002.eqiad.wmnet [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25969 and previous config saved to /var/cache/conftool/dbconfig/20220421-140719-kormat.json [14:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:09:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1020.mgmt.eqiad.wmnet with reboot policy FORCED [14:10:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [14:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25971 and previous config saved to /var/cache/conftool/dbconfig/20220421-141126-ladsgroup.json [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:14] PROBLEM - Host ml-serve1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:18] PROBLEM - Host ml-serve1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:27] this is me, downtime expired, new nodes --^ [14:13:42] RECOVERY - Host ml-serve1006 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:13:58] PROBLEM - Host ml-serve1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:16] RECOVERY - Host ml-serve1007 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:15:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1002.eqiad.wmnet [14:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] RECOVERY - Host ml-serve1008 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:16:21] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) p:05Triage→03Medium [14:16:26] (03PS3) 10Jcrespo: Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 [14:16:28] (03PS1) 10Jcrespo: mediabackup: Hide diffs from mc config file [puppet] - 10https://gerrit.wikimedia.org/r/785166 (https://phabricator.wikimedia.org/T305446) [14:16:37] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) a:03jhathaway [14:16:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1005.eqiad.wmnet [14:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:17:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25972 and previous config saved to /var/cache/conftool/dbconfig/20220421-141727-ladsgroup.json [14:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:18:10] (03CR) 10JHathaway: [C: 03+1] Revert "dumps: Block python requests UA" [puppet] - 10https://gerrit.wikimedia.org/r/784715 (owner: 10Jcrespo) [14:18:48] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:19:26] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Hide diffs from mc config file [puppet] - 10https://gerrit.wikimedia.org/r/785166 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [14:19:32] (03PS2) 10Jcrespo: mediabackup: Hide diffs from mc config file [puppet] - 10https://gerrit.wikimedia.org/r/785166 (https://phabricator.wikimedia.org/T305446) [14:20:44] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:22:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25973 and previous config saved to /var/cache/conftool/dbconfig/20220421-142223-kormat.json [14:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] (03PS1) 10Huji: Re-enable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784718 [14:22:52] (03CR) 10jerkins-bot: [V: 04-1] Re-enable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784718 (owner: 10Huji) [14:24:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25974 and previous config saved to /var/cache/conftool/dbconfig/20220421-142413-ladsgroup.json [14:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:25:46] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1117.eqiad.wmnet with reason: Rebooting for T303174 [14:25:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1117.eqiad.wmnet with reason: Rebooting for T303174 [14:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:16] (03PS2) 10Huji: Re-enable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784718 (https://phabricator.wikimedia.org/T292781) [14:26:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1005.eqiad.wmnet [14:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25975 and previous config saved to /var/cache/conftool/dbconfig/20220421-142631-ladsgroup.json [14:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1006.eqiad.wmnet [14:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:55] ACKNOWLEDGEMENT - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Kormat Rebooting db1117 https://wikitech.wikimedia.org/wiki/HAProxy [14:27:55] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Kormat Rebooting db1117 https://wikitech.wikimedia.org/wiki/HAProxy [14:27:55] ACKNOWLEDGEMENT - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Kormat Rebooting db1117 https://wikitech.wikimedia.org/wiki/HAProxy [14:28:19] (03CR) 10Ottomata: "Hmm, in this case, would it not be better to increase these values in the helmfile service values files, rather than the chart defaults?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:33:55] jouncebot: nowandnext [14:33:55] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [14:33:56] In 1 hour(s) and 26 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1600) [14:34:10] urbanecm: gonna deploy it? [14:34:13] I can deploy itt [14:34:21] Amir1: yeah, the fawiki revert [14:34:30] (03CR) 10Ladsgroup: [C: 03+2] Re-enable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784718 (https://phabricator.wikimedia.org/T292781) (owner: 10Huji) [14:34:36] I do it, don't worry [14:34:39] thanks [14:35:13] (03Merged) 10jenkins-bot: Re-enable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784718 (https://phabricator.wikimedia.org/T292781) (owner: 10Huji) [14:35:24] (03CR) 10Btullis: Increase the RAM request and limit for eventgate pods (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [14:36:58] (03CR) 10Dave Pifke: [C: 03+1] Extend Ferm rules for new webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785117 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [14:36:59] !log ladsgroup@deploy1002 Synchronized wmf-config: Config: [[gerrit:784718|Re-enable article editing by anonymous users on fawiki (T292781)]] (duration: 00m 51s) [14:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:06] T292781: Measure impact of requiring login to edit articles on Persian Wikipedia - https://phabricator.wikimedia.org/T292781 [14:37:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1006.eqiad.wmnet [14:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25976 and previous config saved to /var/cache/conftool/dbconfig/20220421-143727-kormat.json [14:37:30] (03PS2) 10Btullis: Increase the RAM request and limit for eventgate pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) [14:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] (03CR) 10Dave Pifke: [C: 03+1] Apply role::webperf::processors_and_site to webperf1003/2003 [puppet] - 10https://gerrit.wikimedia.org/r/785115 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [14:37:40] * Amir1 afks [14:38:12] (03CR) 10Dave Pifke: [C: 03+1] Switch webperf1001/1003 for eventual removal [puppet] - 10https://gerrit.wikimedia.org/r/785116 (https://phabricator.wikimedia.org/T205460) (owner: 10Muehlenhoff) [14:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25977 and previous config saved to /var/cache/conftool/dbconfig/20220421-143918-ladsgroup.json [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:40:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25978 and previous config saved to /var/cache/conftool/dbconfig/20220421-144137-ladsgroup.json [14:41:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:41:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25979 and previous config saved to /var/cache/conftool/dbconfig/20220421-144145-ladsgroup.json [14:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:03] (03CR) 10Ottomata: [C: 03+1] "You could do eventgate-logging-external as well if you like. Either way!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [14:48:30] (03CR) 10Muehlenhoff: [C: 03+2] Fix up host globbing for ping servers [puppet] - 10https://gerrit.wikimedia.org/r/785125 (owner: 10Muehlenhoff) [14:52:12] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) At the moment we are getting between ~30 and ~60 requests receiving 503 responses pe... [14:52:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1152 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25980 and previous config saved to /var/cache/conftool/dbconfig/20220421-145231-kormat.json [14:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1153.eqiad.wmnet with reason: Rebooting for T303174 [14:52:58] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1153.eqiad.wmnet with reason: Rebooting for T303174 [14:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1153 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25981 and previous config saved to /var/cache/conftool/dbconfig/20220421-145303-kormat.json [14:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25982 and previous config saved to /var/cache/conftool/dbconfig/20220421-145424-ladsgroup.json [14:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:11] (03CR) 10Ssingh: P:wikidough: add a check to ensure service has been restarted (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [14:56:41] (03PS4) 10Ssingh: P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 [14:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25983 and previous config saved to /var/cache/conftool/dbconfig/20220421-145758-ladsgroup.json [14:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:59:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25984 and previous config saved to /var/cache/conftool/dbconfig/20220421-145914-kormat.json [14:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:08:13] (03CR) 10Muehlenhoff: [C: 03+2] Extend Ferm rules for new webperf hosts [puppet] - 10https://gerrit.wikimedia.org/r/785117 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [15:09:18] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25985 and previous config saved to /var/cache/conftool/dbconfig/20220421-150929-ladsgroup.json [15:09:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:09:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:09:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25986 and previous config saved to /var/cache/conftool/dbconfig/20220421-150937-ladsgroup.json [15:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:25] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1143.eqiad.wmnet with OS buster [15:10:26] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:29] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:10:38] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:41] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:11:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [15:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [15:11:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [15:11:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster [15:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [15:12:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [15:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [15:12:27] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1146.eqiad.wmnet with OS buster [15:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec... [15:12:37] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1145.eqiad.wmnet with OS buster [15:12:40] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1144.eqiad.wmnet with OS buster [15:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec... [15:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster exec... [15:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25987 and previous config saved to /var/cache/conftool/dbconfig/20220421-151303-ladsgroup.json [15:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster [15:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [15:13:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [15:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [15:14:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [15:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25988 and previous config saved to /var/cache/conftool/dbconfig/20220421-151418-kormat.json [15:14:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [15:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25989 and previous config saved to /var/cache/conftool/dbconfig/20220421-151610-ladsgroup.json [15:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:20:54] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [15:28:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25990 and previous config saved to /var/cache/conftool/dbconfig/20220421-152809-ladsgroup.json [15:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25991 and previous config saved to /var/cache/conftool/dbconfig/20220421-152922-kormat.json [15:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] (03CR) 10Btullis: [C: 03+2] Increase the RAM request and limit for eventgate pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [15:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25992 and previous config saved to /var/cache/conftool/dbconfig/20220421-153115-ladsgroup.json [15:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:33:53] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:01] (03Merged) 10jenkins-bot: Increase the RAM request and limit for eventgate pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/785151 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [15:36:03] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:37] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:36] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:37:42] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:40] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:49] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:39:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1143.eqiad.wmnet with OS buster [15:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster exec... [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:35] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:38] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1144.eqiad.wmnet with OS buster [15:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster exec... [15:41:57] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1145.eqiad.wmnet with OS buster [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec... [15:42:24] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1146.eqiad.wmnet with OS buster [15:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec... [15:42:37] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have increased the amount of RAM available to the eventgate-analytics-external dep... [15:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25993 and previous config saved to /var/cache/conftool/dbconfig/20220421-154314-ladsgroup.json [15:43:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:43:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db1153 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25994 and previous config saved to /var/cache/conftool/dbconfig/20220421-154426-kormat.json [15:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25995 and previous config saved to /var/cache/conftool/dbconfig/20220421-154620-ladsgroup.json [15:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:05] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:49:34] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:52:53] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:53:28] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) >>! In T304712#7857483, @ayounsi wrote: > Thanks! > > I like your idea of putting the capacity in the table, I added dedicated columns for it. > > Note that I don't know if there is enough total... [15:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:00:04] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25996 and previous config saved to /var/cache/conftool/dbconfig/20220421-160125-ladsgroup.json [16:01:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:01:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:01:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25997 and previous config saved to /var/cache/conftool/dbconfig/20220421-160133-ladsgroup.json [16:01:34] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) @arzhel fixed the reboot issue, the external disk attached to the router was causing the reboots. I updated JUNOS to junos-srxsme-20.2R3-S2.... [16:06:21] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [16:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25998 and previous config saved to /var/cache/conftool/dbconfig/20220421-160804-ladsgroup.json [16:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:17:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1120.eqiad.wmnet with reason: Rebooting for T303174 [16:17:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1120.eqiad.wmnet with reason: Rebooting for T303174 [16:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:39] !log kormat@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25999 and previous config saved to /var/cache/conftool/dbconfig/20220421-162039-kormat.json [16:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26000 and previous config saved to /var/cache/conftool/dbconfig/20220421-162309-ladsgroup.json [16:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:51] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Krinkle) [16:30:06] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Krinkle) [16:30:17] (03PS1) 10Andrew Bogott: Renumber ns-recursor[0,1].openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/785182 [16:30:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:30:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [16:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26001 and previous config saved to /var/cache/conftool/dbconfig/20220421-163031-ladsgroup.json [16:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:31:23] (03CR) 10Andrew Bogott: "Now that a lot of this is delegated to netbox I'm not sure how to avoid reusing an IP that's already being used by netbox. Please advise!" [dns] - 10https://gerrit.wikimedia.org/r/785182 (owner: 10Andrew Bogott) [16:34:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10Cmjohnson) DIMM has been shipped [16:35:43] !log kormat@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26002 and previous config saved to /var/cache/conftool/dbconfig/20220421-163543-kormat.json [16:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:07] (03CR) 10Volans: "replies inline, no blockers for me" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [16:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26003 and previous config saved to /var/cache/conftool/dbconfig/20220421-163814-ladsgroup.json [16:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:24] !log replace mr1-eqiad - T294474 [16:43:24] Sorry, you are not authorized to perform this [16:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:29] T294474: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 [16:43:54] wm-bot: wut? [16:45:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [16:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [16:50:47] !log kormat@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26004 and previous config saved to /var/cache/conftool/dbconfig/20220421-165047-kormat.json [16:50:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) [16:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P26005 and previous config saved to /var/cache/conftool/dbconfig/20220421-165319-ladsgroup.json [16:53:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:53:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:53:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P26006 and previous config saved to /var/cache/conftool/dbconfig/20220421-165333-ladsgroup.json [16:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] PROBLEM - Host ms-be1053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:53:54] PROBLEM - Host db1140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:53:54] PROBLEM - Host db1139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:54:26] PROBLEM - SSH on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:54:40] PROBLEM - Host relforge1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:54:42] PROBLEM - Host relforge1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:55:14] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:55:18] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [16:56:08] PROBLEM - Host ms-be1056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:12] PROBLEM - Host ms-be1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:12] PROBLEM - Host ms-be1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:12] PROBLEM - Host ms-be1058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:12] PROBLEM - Host ms-be1051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:12] PROBLEM - Host ms-be1054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:13] PROBLEM - Host ms-be1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:24] PROBLEM - Host ms-be1059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:56:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:56:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26007 and previous config saved to /var/cache/conftool/dbconfig/20220421-165635-ladsgroup.json [16:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:57:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10ayounsi) Swap has been done successfully! Left to do: wipe the old one, rename the console server port of the new one. [16:58:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) loaded, configuration file verified working moved cables to new mr1-eqiad left scs connection to old mr1 to wipe, still requires scs connecti... [16:59:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26008 and previous config saved to /var/cache/conftool/dbconfig/20220421-165946-ladsgroup.json [16:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:05:51] !log kormat@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26009 and previous config saved to /var/cache/conftool/dbconfig/20220421-170551-kormat.json [17:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P26010 and previous config saved to /var/cache/conftool/dbconfig/20220421-170959-ladsgroup.json [17:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:14:28] RECOVERY - SSH on ms-fe1012 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26011 and previous config saved to /var/cache/conftool/dbconfig/20220421-171451-ladsgroup.json [17:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:16] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [17:15:20] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [17:21:03] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Summary update from multiple out of band support emails and conversations with our Dell account team: * confirmed that the missing/spin down doesn't work for them either, and chipset manufacturer... [17:25:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26012 and previous config saved to /var/cache/conftool/dbconfig/20220421-172504-ladsgroup.json [17:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] 10SRE-Access-Requests: WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10lmata) [17:29:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26013 and previous config saved to /var/cache/conftool/dbconfig/20220421-172956-ladsgroup.json [17:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26015 and previous config saved to /var/cache/conftool/dbconfig/20220421-173046-ladsgroup.json [17:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:40:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26016 and previous config saved to /var/cache/conftool/dbconfig/20220421-174009-ladsgroup.json [17:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26017 and previous config saved to /var/cache/conftool/dbconfig/20220421-174501-ladsgroup.json [17:45:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [17:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26018 and previous config saved to /var/cache/conftool/dbconfig/20220421-174509-ladsgroup.json [17:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26019 and previous config saved to /var/cache/conftool/dbconfig/20220421-174551-ladsgroup.json [17:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:51:01] (03CR) 10Ssingh: P:wikidough: add a check to ensure service has been restarted (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [17:52:37] (03PS5) 10Ssingh: P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 [17:53:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34946/console" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [17:55:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P26020 and previous config saved to /var/cache/conftool/dbconfig/20220421-175514-ladsgroup.json [17:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:00:04] jeena and brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T1800). [18:00:13] o/ [18:00:33] time to deploy [18:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26021 and previous config saved to /var/cache/conftool/dbconfig/20220421-180056-ladsgroup.json [18:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:41] (03PS1) 10Jeena Huneidi: all wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785187 [18:01:43] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785187 (owner: 10Jeena Huneidi) [18:02:26] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785187 (owner: 10Jeena Huneidi) [18:03:43] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.8 refs T305214 [18:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:49] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [18:04:53] 10SRE, 10SRE-Access-Requests: WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10lmata) [18:07:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:07:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:19] 10SRE, 10SRE-Access-Requests: WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10lmata) @Volans, please correct or amend this task if I missed anything. @wiki_willy we will need your approval as @Jclark-ctr 's manager @MoritzMuehlenhoff: adding you for awareness and feedba... [18:08:38] (03PS1) 10Ssingh: monitoring_service: specify units for configuration attributes [puppet] - 10https://gerrit.wikimedia.org/r/785188 [18:08:57] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10lmata) [18:09:47] (03CR) 10Ssingh: [V: 03+1] "[Do not merge till Monday]. Thanks to BBlack, Daniel Zahn, and Volans for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [18:11:12] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10MoritzMuehlenhoff) >>! In T306654#7872413, @lmata wrote: > @MoritzMuehlenhoff: adding you for awareness and feedback. Yes, it sounds good to me... [18:15:35] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) The RAM upgrade has not resulted in any improvement. {F35061891,width=60%} [18:15:59] (03PS1) 10Dzahn: gitlab: ensure home dir for runner_user exists when running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/785189 (https://phabricator.wikimedia.org/T297659) [18:16:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26022 and previous config saved to /var/cache/conftool/dbconfig/20220421-181601-ladsgroup.json [18:16:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:16:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:16:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:16:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26023 and previous config saved to /var/cache/conftool/dbconfig/20220421-181614-ladsgroup.json [18:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:49] (03CR) 10Andrew Bogott: [C: 04-1] "Need to reserve IPs ahead of time according to https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_I" [dns] - 10https://gerrit.wikimedia.org/r/785182 (owner: 10Andrew Bogott) [18:17:55] (03PS2) 10Dzahn: gitlab: ensure home dir for runner_user exists when running as non-root [puppet] - 10https://gerrit.wikimedia.org/r/785189 (https://phabricator.wikimedia.org/T297659) [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:35:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10Krinkle) [18:35:51] (03CR) 10Dzahn: [C: 03+1] "checked this. the intervals can be in the form "1s" to mean actually 1 second or just "1" when it means 1 minute. yep!" [puppet] - 10https://gerrit.wikimedia.org/r/785188 (owner: 10Ssingh) [18:37:36] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34947/gitlab-runner2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/785189 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [18:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26024 and previous config saved to /var/cache/conftool/dbconfig/20220421-183807-ladsgroup.json [18:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:43:48] (03PS1) 10Dzahn: Revert "Revert "Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001""" [puppet] - 10https://gerrit.wikimedia.org/r/784723 [18:45:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26025 and previous config saved to /var/cache/conftool/dbconfig/20220421-184523-ladsgroup.json [18:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:29] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:46:28] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Fix unquoted URL parameters in Icgina health checks - https://phabricator.wikimedia.org/T304323 (10Krinkle) [18:47:59] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001""" [puppet] - 10https://gerrit.wikimedia.org/r/784723 (owner: 10Dzahn) [18:49:24] (03CR) 10Ssingh: monitoring_service: specify units for configuration attributes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785188 (owner: 10Ssingh) [18:49:33] (03CR) 10Ssingh: [C: 03+2] monitoring_service: specify units for configuration attributes [puppet] - 10https://gerrit.wikimedia.org/r/785188 (owner: 10Ssingh) [18:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26026 and previous config saved to /var/cache/conftool/dbconfig/20220421-185312-ladsgroup.json [18:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:23] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Sustainability (Incident Followup), 10User-Ladsgroup: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Krinkle) [19:00:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26027 and previous config saved to /var/cache/conftool/dbconfig/20220421-190029-ladsgroup.json [19:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:07:36] (03PS1) 10Dzahn: gitlab_runner: use config_path variable when creating config file [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) [19:07:49] (03PS1) 10Bking: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 [19:08:07] (03PS2) 10Dzahn: gitlab_runner: use config_path variable when creating config file [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) [19:08:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26028 and previous config saved to /var/cache/conftool/dbconfig/20220421-190817-ladsgroup.json [19:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:21] !log set index.unassigned.node_left.delayed_timeout to null for all indices on elasticsearch-eqiad-psi (:9200), reverting previous test of 10m back to defaults [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:35] (03PS2) 10Ryan Kemper: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [19:09:19] (03CR) 10Dzahn: [C: 04-2] "ah, no, this is the template that is used to create the actual config from.. hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [19:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26029 and previous config saved to /var/cache/conftool/dbconfig/20220421-191534-ladsgroup.json [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:16:46] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:16:54] 10SRE-OnFire, 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the dat... - https://phabricator.wikimedia.org/T303499 [19:18:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:18:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:18:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P26030 and previous config saved to /var/cache/conftool/dbconfig/20220421-191847-ladsgroup.json [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:19:31] 10SRE, 10Traffic-Icebox, 10Wikimedia-Incident: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Krinkle) a:03Vgutierrez I believe this was resolved since and/or obsoleted by HAProxy, is that right? [19:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26031 and previous config saved to /var/cache/conftool/dbconfig/20220421-192322-ladsgroup.json [19:23:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:23:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [19:23:26] papaul: re: T305568, does that notation you added to the description for aqs200[1-4] (e.g. B6U35 ge-6/0/34) indicate the row? Does this indicate these are in row 'B'? [19:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26032 and previous config saved to /var/cache/conftool/dbconfig/20220421-192330-ladsgroup.json [19:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:35] T305568: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 [19:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:56] !log depooling & disabling puppet on cp2029 for some manual testing T303534 [19:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26033 and previous config saved to /var/cache/conftool/dbconfig/20220421-193039-ladsgroup.json [19:30:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:30:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:30:45] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:30:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26034 and previous config saved to /var/cache/conftool/dbconfig/20220421-193052-ladsgroup.json [19:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26035 and previous config saved to /var/cache/conftool/dbconfig/20220421-193302-ladsgroup.json [19:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:09] (03PS2) 10Andrew Bogott: Renumber ns-recursor[0,1].openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/785182 [19:34:25] 10SRE, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Krinkle) [19:35:02] (03CR) 10Andrew Bogott: [C: 03+2] Renumber ns-recursor[0,1].openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/785182 (owner: 10Andrew Bogott) [19:38:59] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Fix unquoted URL parameters in Icinga health checks - https://phabricator.wikimedia.org/T304323 (10colewhite) [19:39:44] (03PS3) 10Dzahn: gitlab_runner: ensure the full path to the config location exists [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) [19:40:03] (03PS4) 10Dzahn: gitlab_runner: ensure the full path to the config location exists [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) [19:41:26] (03PS5) 10Dzahn: gitlab_runner: ensure the full path to the config location exists [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) [19:41:57] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: ensure the full path to the config location exists [puppet] - 10https://gerrit.wikimedia.org/r/785191 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [19:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P26036 and previous config saved to /var/cache/conftool/dbconfig/20220421-194303-ladsgroup.json [19:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:44:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:44:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26037 and previous config saved to /var/cache/conftool/dbconfig/20220421-194419-ladsgroup.json [19:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26038 and previous config saved to /var/cache/conftool/dbconfig/20220421-194807-ladsgroup.json [19:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:56] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [19:58:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26039 and previous config saved to /var/cache/conftool/dbconfig/20220421-195808-ladsgroup.json [19:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:59:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26040 and previous config saved to /var/cache/conftool/dbconfig/20220421-195950-ladsgroup.json [19:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T2000). [20:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26041 and previous config saved to /var/cache/conftool/dbconfig/20220421-200154-ladsgroup.json [20:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26042 and previous config saved to /var/cache/conftool/dbconfig/20220421-200312-ladsgroup.json [20:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:44] !log [puppetmaster1001:~] $ sudo puppet cert clean gitlab-runner2001.codfw.wmnet [20:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:02] !log [ganeti2021:~] $ sudo gnt-instance shutdown gitlab-runner2001.codfw.wmnet [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:04] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03Cmjohnson Ok, next steps: I've set the disk to identify flash with: perccli64 /c0/e64/s13 start locate ` root@dumpsdata1007:~# perccli64 /c0/e64/s13 start locate CLI Version = 00... [20:13:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26043 and previous config saved to /var/cache/conftool/dbconfig/20220421-201313-ladsgroup.json [20:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:11] !log reimaging gitlab-runner2001.codfw.wmnet one more time to confirm things work from scratch now T297659 [20:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:16] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [20:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26044 and previous config saved to /var/cache/conftool/dbconfig/20220421-201455-ladsgroup.json [20:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26045 and previous config saved to /var/cache/conftool/dbconfig/20220421-201659-ladsgroup.json [20:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26046 and previous config saved to /var/cache/conftool/dbconfig/20220421-201817-ladsgroup.json [20:18:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:18:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:24] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:18:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26047 and previous config saved to /var/cache/conftool/dbconfig/20220421-201825-ladsgroup.json [20:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26048 and previous config saved to /var/cache/conftool/dbconfig/20220421-202135-ladsgroup.json [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:36] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:24:24] (03PS5) 10Juan90264: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784717 [20:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P26049 and previous config saved to /var/cache/conftool/dbconfig/20220421-202818-ladsgroup.json [20:28:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:28:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:28:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P26050 and previous config saved to /var/cache/conftool/dbconfig/20220421-202826-ladsgroup.json [20:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26051 and previous config saved to /var/cache/conftool/dbconfig/20220421-203003-ladsgroup.json [20:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26052 and previous config saved to /var/cache/conftool/dbconfig/20220421-203204-ladsgroup.json [20:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:53] Hello brennen [20:36:17] Sorry for taking so long, I put two changes in Deployments [20:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26053 and previous config saved to /var/cache/conftool/dbconfig/20220421-203640-ladsgroup.json [20:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:51] If it's still available, could you deploy it? [20:39:10] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@bd28d80]: (no justification provided) [20:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:37] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@bd28d80]: (no justification provided) (duration: 00m 27s) [20:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:51] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@bd28d80]: (no justification provided) [20:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:58] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@bd28d80]: (no justification provided) (duration: 00m 07s) [20:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:17] thcipriani ? [20:43:42] PROBLEM - Check size of conntrack table on gitlab-runner2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:44:20] ^ me..reimage in progress [20:44:28] PROBLEM - Check for large files in client bucket on gitlab-runner2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [20:44:38] PROBLEM - puppet last run on gitlab-runner2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:45:00] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [20:45:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [20:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26054 and previous config saved to /var/cache/conftool/dbconfig/20220421-204508-ladsgroup.json [20:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:45:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [20:45:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [20:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26055 and previous config saved to /var/cache/conftool/dbconfig/20220421-204532-ladsgroup.json [20:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:46] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) @Papaul does this notation you added refer to the row (and rack) location? For example: aqs2001: !!B6U35 ge-6/0/34!!, does mean row 'B'? [20:47:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26056 and previous config saved to /var/cache/conftool/dbconfig/20220421-204709-ladsgroup.json [20:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:52] mutante: Can you tell me if the backport will occur? [20:48:10] Juan_90264: I don't know the answer, sorry [20:48:12] jouncebot: now [20:48:12] For the next 0 hour(s) and 11 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T2000) [20:48:33] Okay [20:49:00] Juan_90264: oooh.. I think it's because tomorrow is a WMF holiday [20:49:04] so nobody will be working [20:49:13] that likely means today is "like Friday" [20:49:17] and no deploys on Friday [20:49:26] thcipriani: right? ^ [20:50:50] !log re-enabled puppet and repooled cp2029 [20:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26057 and previous config saved to /var/cache/conftool/dbconfig/20220421-205145-ladsgroup.json [20:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:19] mutante: I understand, if thcipriani is online I wait to confirm this [20:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P26058 and previous config saved to /var/cache/conftool/dbconfig/20220421-205256-ladsgroup.json [20:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:53:04] RECOVERY - Check for large files in client bucket on gitlab-runner2001 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [20:54:10] RECOVERY - Check size of conntrack table on gitlab-runner2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:54:55] mutante: Juan_90264 yes, often that is the case; however, I wasn't planning on doing that today. I wasn't around because I thought there weren't any patches in the window and walked away from the computer for a bit :) [20:55:47] Juan_90264: this patch is failing CI so I won't deploy that one today https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/784722, also it's not a backport [20:56:12] Juan_90264: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/784717 seems safe -- is there a bug it's attached to? [20:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26059 and previous config saved to /var/cache/conftool/dbconfig/20220421-205906-ladsgroup.json [20:59:07] (03PS6) 10Juan90264: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784717 (https://phabricator.wikimedia.org/T303577) [20:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:59:28] thcipriani: Yes, https://phabricator.wikimedia.org/T303577 [20:59:42] Juan_90264: I think it's failing CI because the arrows are facing left? <= vs => ? [21:00:36] (03PS1) 10Urbanecm: GlobalUserSelectQueryBuilder: Do not fatal when no users are returned [extensions/CentralAuth] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785207 (https://phabricator.wikimedia.org/T306535) [21:01:08] mutante: I know they're doing it, but that's because of the writing. This writing is leaving the arrow backwards [21:01:09] (03CR) 10Thcipriani: [C: 03+2] Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784717 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [21:01:30] because it's a RTL language? I was wondering that [21:01:58] (03Merged) 10jenkins-bot: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784717 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [21:02:34] mutante: Yes [21:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:02:45] Perfect change merged! [21:03:06] *Perfect merged! [21:03:09] Juan_90264: live on mwdebug1002, check please :) [21:03:29] Okay, I will test [21:04:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:04:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26060 and previous config saved to /var/cache/conftool/dbconfig/20220421-210414-ladsgroup.json [21:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:04:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:04:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:04:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26061 and previous config saved to /var/cache/conftool/dbconfig/20220421-210650-ladsgroup.json [21:06:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:06:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:56] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:06:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26062 and previous config saved to /var/cache/conftool/dbconfig/20220421-210658-ladsgroup.json [21:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26063 and previous config saved to /var/cache/conftool/dbconfig/20220421-210801-ladsgroup.json [21:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:38] (03PS1) 10Dzahn: gitlab::runner: ensure config dir is owned by non-privileged user [puppet] - 10https://gerrit.wikimedia.org/r/785198 (https://phabricator.wikimedia.org/T297659) [21:09:17] (03CR) 10jerkins-bot: [V: 04-1] gitlab::runner: ensure config dir is owned by non-privileged user [puppet] - 10https://gerrit.wikimedia.org/r/785198 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [21:09:39] thcipriani: Okay, everything seems to be ok, but I was missing activating "$wgCopyUploadsFromSpecialUpload" [21:10:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26064 and previous config saved to /var/cache/conftool/dbconfig/20220421-211018-ladsgroup.json [21:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:07] Juan_90264: do you need to make a change to your patch? [21:13:01] thcipriani: Yes, because if there is no way to use this privilege [21:13:24] (03PS2) 10Dzahn: gitlab::runner: ensure config dir is owned by non-privileged user [puppet] - 10https://gerrit.wikimedia.org/r/785198 (https://phabricator.wikimedia.org/T297659) [21:13:30] makes sense, thank you [21:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26065 and previous config saved to /var/cache/conftool/dbconfig/20220421-211411-ladsgroup.json [21:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:04] (03PS1) 10Thcipriani: Revert "Enable '$wgCopyUploadsDomains' to viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785200 [21:15:27] (03CR) 10Thcipriani: [C: 03+2] Revert "Enable '$wgCopyUploadsDomains' to viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785200 (owner: 10Thcipriani) [21:15:57] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34947/" [puppet] - 10https://gerrit.wikimedia.org/r/785198 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [21:16:17] (03Merged) 10jenkins-bot: Revert "Enable '$wgCopyUploadsDomains' to viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785200 (owner: 10Thcipriani) [21:18:49] Okay reverted [21:19:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:19:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26066 and previous config saved to /var/cache/conftool/dbconfig/20220421-212022-ladsgroup.json [21:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:20:44] thcipriani: I'll have to create another change including what was missing, right? (I ask because I never had to do this) [21:21:56] RECOVERY - puppet last run on gitlab-runner2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:22:23] Juan_90264: yes, sorry, should have said—please make a new patch set for that change. Let's schedule it for a different backport window as we're 20 minutes over on this window already. Your second patch shouldn't be merged in this window---you should get code review from someone who works on that extension. [21:22:29] ^ yay, got it working (so that a gitlab-runner runs as non-root on bullseye) [21:22:44] neat! kudos mutante [21:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26067 and previous config saved to /var/cache/conftool/dbconfig/20220421-212306-ladsgroup.json [21:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:12] nice mutante [21:23:20] ;) ty, credit to J.elto as well [21:24:55] thcipriani: Okay, if I still have time I can create another change to resolve soon. And then I see the problem of the second change [21:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26068 and previous config saved to /var/cache/conftool/dbconfig/20220421-212523-ladsgroup.json [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:48] except.. it fails to "unregister" the existing runner .. [21:26:03] Juan_90264: sounds good, and we'll have to deploy that change another day :) [21:26:08] "status=only http or https scheme supported" hmmm [21:29:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26069 and previous config saved to /var/cache/conftool/dbconfig/20220421-212916-ladsgroup.json [21:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:44] thcipriani: Perfect, change created: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/785208 [21:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26070 and previous config saved to /var/cache/conftool/dbconfig/20220421-213529-ladsgroup.json [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:37] (03CR) 10Sharvaniharan: "Please review when you get a chance :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783874 (https://phabricator.wikimedia.org/T306385) (owner: 10Sharvaniharan) [21:38:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P26071 and previous config saved to /var/cache/conftool/dbconfig/20220421-213811-ladsgroup.json [21:38:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:38:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:38:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P26072 and previous config saved to /var/cache/conftool/dbconfig/20220421-213819-ladsgroup.json [21:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:45] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) perccli64 /c0 show all also shows a physical disk list, we'll want to run to confirm it sees the disk gone when removed. [21:40:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P26073 and previous config saved to /var/cache/conftool/dbconfig/20220421-214027-ladsgroup.json [21:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26074 and previous config saved to /var/cache/conftool/dbconfig/20220421-214035-ladsgroup.json [21:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:39] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner1001.eqiad.wmnet with reason: reimage [21:40:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner1001.eqiad.wmnet with reason: reimage [21:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:36] thcipriani: It's still available? If not, no problem [21:42:22] (03CR) 10Cwhite: "Manually applied to grafana-next.wm.o for testing. @phedenskog, does it function as you expect?" [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [21:42:35] !log shutting down and reimaging gitlab-runner1001 T297659 [21:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:41] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [21:44:18] (03PS3) 10Dzahn: site: use appserver in codfw C3, cleanup duplicate insetup definition [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [21:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26075 and previous config saved to /var/cache/conftool/dbconfig/20220421-214422-ladsgroup.json [21:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:44:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [21:44:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [21:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26076 and previous config saved to /var/cache/conftool/dbconfig/20220421-214445-ladsgroup.json [21:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:14] (03CR) 10Dzahn: "thank you! the root issue is that we have no workflow that ensures a follow-up task is created after dcops is done procuring" [puppet] - 10https://gerrit.wikimedia.org/r/785147 (https://phabricator.wikimedia.org/T290192) (owner: 10Jelto) [21:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:54] PROBLEM - Check systemd state on gitlab-runner2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service,docker-resource-monitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:50:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P26077 and previous config saved to /var/cache/conftool/dbconfig/20220421-215034-ladsgroup.json [21:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:53:29] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) @Eevans yes B is row B , 6 is the rack number and U35 is the position of the server in the rack (row B rack 6 position 35) [21:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26078 and previous config saved to /var/cache/conftool/dbconfig/20220421-215532-ladsgroup.json [21:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26079 and previous config saved to /var/cache/conftool/dbconfig/20220421-215540-ladsgroup.json [21:55:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:55:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:45] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26080 and previous config saved to /var/cache/conftool/dbconfig/20220421-215547-ladsgroup.json [21:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26081 and previous config saved to /var/cache/conftool/dbconfig/20220421-215807-ladsgroup.json [21:58:09] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) Boldly move forward since viwiki depolyed this feature ~ [21:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:37] !log gitlab-runner2001 - installing apparmor ('apparmor' is the user utilities package and was NOT installed, libapparmor1 WAS installed), this caused bug https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1808456.html after upgrading gitlab-runner to bullseye because bullseye comes with libapparmor1 by default as opposed to before T297659 [22:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:44] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [22:02:01] !log gitlab-runner2001 - systemctl start docker-resource-monitor ; systemctl start docker-gc [22:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:26] RECOVERY - Check systemd state on gitlab-runner2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P26082 and previous config saved to /var/cache/conftool/dbconfig/20220421-220539-ladsgroup.json [22:05:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [22:05:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [22:05:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:05:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [22:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26083 and previous config saved to /var/cache/conftool/dbconfig/20220421-220552-ladsgroup.json [22:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26084 and previous config saved to /var/cache/conftool/dbconfig/20220421-221037-ladsgroup.json [22:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26085 and previous config saved to /var/cache/conftool/dbconfig/20220421-221312-ladsgroup.json [22:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:30] (03PS1) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:15:03] (03PS2) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:15:14] (03PS3) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:15:20] (03PS4) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:16:46] (03CR) 10jerkins-bot: [V: 04-1] docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [22:17:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26086 and previous config saved to /var/cache/conftool/dbconfig/20220421-221728-ladsgroup.json [22:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:18:19] (03PS5) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:21:17] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05Cmjohnson→03RobH Update: Chris pulled the offline SSD and I confirmed OS saw it go away, then after 5 minutes put it back into place and the system detected it and started an automatic reb... [22:22:32] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/34948/" [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [22:24:38] (03PS1) 10Dzahn: gitlab::runner: if on buster, ensure apparmor package is installed [puppet] - 10https://gerrit.wikimedia.org/r/785228 (https://phabricator.wikimedia.org/T297659) [22:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26087 and previous config saved to /var/cache/conftool/dbconfig/20220421-222534-ladsgroup.json [22:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P26088 and previous config saved to /var/cache/conftool/dbconfig/20220421-222542-ladsgroup.json [22:25:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [22:25:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [22:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P26089 and previous config saved to /var/cache/conftool/dbconfig/20220421-222550-ladsgroup.json [22:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26090 and previous config saved to /var/cache/conftool/dbconfig/20220421-222817-ladsgroup.json [22:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/34949/gitlab-runner2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/785228 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [22:30:56] (03PS1) 10Stang: Enable "upload_by_url" feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) [22:32:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26091 and previous config saved to /var/cache/conftool/dbconfig/20220421-223233-ladsgroup.json [22:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:53] Returned [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:33:51] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Patch-For-Review: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) 05Stalled→03Open a:03Stang [22:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P26092 and previous config saved to /var/cache/conftool/dbconfig/20220421-223357-ladsgroup.json [22:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:34:51] (03CR) 10Dzahn: "I noticed when running puppet for the first time on a new host there are errors because profile::systemd::timesyncd tries to ensure file /" [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [22:36:06] thcipriani: I'll leave it for another backport window, thanks for taking the time on the first change! (Which unfortunately was later reverted, to add what was missing) [22:36:33] (03CR) 10Dzahn: standard::ntp: move standard ntp to its own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/730852 (owner: 10Jbond) [22:37:39] Goodbye and good morning, good afternoon or good night! [22:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26093 and previous config saved to /var/cache/conftool/dbconfig/20220421-224039-ladsgroup.json [22:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:52] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [22:42:07] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [22:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26094 and previous config saved to /var/cache/conftool/dbconfig/20220421-224322-ladsgroup.json [22:43:23] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Todo: * test on other distros we use * get partman to work with this, as our existing recipes expect the flexbays to be the SDA virtual drive and the new controller always puts them at a higher ID... [22:43:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:43:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:28] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:43:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:43:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [22:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [22:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:44:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [22:44:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [22:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26095 and previous config saved to /var/cache/conftool/dbconfig/20220421-224437-ladsgroup.json [22:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26096 and previous config saved to /var/cache/conftool/dbconfig/20220421-224657-ladsgroup.json [22:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26097 and previous config saved to /var/cache/conftool/dbconfig/20220421-224738-ladsgroup.json [22:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P26098 and previous config saved to /var/cache/conftool/dbconfig/20220421-224902-ladsgroup.json [22:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:16] !log gitlab - deleting runner 'ubuntu..something' that has been offline for 2 months, not sure who made it [22:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:48] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Andrew) [22:54:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Andrew) a:05Papaul→03Andrew [22:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P26099 and previous config saved to /var/cache/conftool/dbconfig/20220421-225544-ladsgroup.json [22:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26100 and previous config saved to /var/cache/conftool/dbconfig/20220421-230202-ladsgroup.json [23:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26101 and previous config saved to /var/cache/conftool/dbconfig/20220421-230243-ladsgroup.json [23:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:03:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:03:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26102 and previous config saved to /var/cache/conftool/dbconfig/20220421-230307-ladsgroup.json [23:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P26103 and previous config saved to /var/cache/conftool/dbconfig/20220421-230408-ladsgroup.json [23:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P26104 and previous config saved to /var/cache/conftool/dbconfig/20220421-231049-ladsgroup.json [23:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:10:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26105 and previous config saved to /var/cache/conftool/dbconfig/20220421-231707-ladsgroup.json [23:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P26106 and previous config saved to /var/cache/conftool/dbconfig/20220421-231913-ladsgroup.json [23:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:19:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:19:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [23:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [23:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [23:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26107 and previous config saved to /var/cache/conftool/dbconfig/20220421-232153-ladsgroup.json [23:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:25:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26108 and previous config saved to /var/cache/conftool/dbconfig/20220421-233212-ladsgroup.json [23:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:17] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:36:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26109 and previous config saved to /var/cache/conftool/dbconfig/20220421-233658-ladsgroup.json [23:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26110 and previous config saved to /var/cache/conftool/dbconfig/20220421-235203-ladsgroup.json [23:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [23:58:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [23:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P26111 and previous config saved to /var/cache/conftool/dbconfig/20220421-235814-ladsgroup.json [23:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown