[00:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23650 and previous config saved to /var/cache/conftool/dbconfig/20220330-000011-ladsgroup.json [00:00:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:00:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:00:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23651 and previous config saved to /var/cache/conftool/dbconfig/20220330-000019-ladsgroup.json [00:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:01] I'll deploy and see what happens [00:01:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:33] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:02:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:27] !log catrope@deploy1002 Started scap: Update Kashmiri namespace names (T304790) [00:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:32] T304790: Update Namespace translations on Ks Wiki - https://phabricator.wikimedia.org/T304790 [00:02:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:21] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:13] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:53] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:00] !log catrope@deploy1002 Scap failed!: 6/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [00:07:00] !log catrope@deploy1002 scap failed: RuntimeError Scap failed!: 6/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. (duration: 04m 32s) [00:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:01] (03PS1) 10Catrope: Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 [00:09:07] (03CR) 10Catrope: [C: 03+2] Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 (owner: 10Catrope) [00:09:32] (03Abandoned) 10Catrope: Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 (owner: 10Catrope) [00:09:53] !log catrope@deploy1002 Started scap: Update Kashmiri namespace names (T304790) [00:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:58] T304790: Update Namespace translations on Ks Wiki - https://phabricator.wikimedia.org/T304790 [00:10:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23652 and previous config saved to /var/cache/conftool/dbconfig/20220330-001010-ladsgroup.json [00:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:22] !log catrope@deploy1002 Scap failed!: 8/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [00:10:22] !log catrope@deploy1002 scap failed: RuntimeError Scap failed!: 8/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. (duration: 00m 28s) [00:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:30] !log catrope@deploy1002 Started scap: Update Kashmiri namespace names (T304790) [00:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:58] !log catrope@deploy1002 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [00:11:58] !log catrope@deploy1002 scap failed: RuntimeError Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. (duration: 00m 28s) [00:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:04] !log catrope@deploy1002 Started scap: Update Kashmiri namespace names (T304790) [00:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:22] (03Restored) 10Catrope: Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 (owner: 10Catrope) [00:13:28] (03CR) 10Catrope: [C: 03+2] Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 (owner: 10Catrope) [00:15:12] Still not working.... [00:15:35] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 279 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:16:43] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:19:39] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 395 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:20:27] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 582 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:22:39] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:23:14] RoanKattouw: Does this type of change usually take time to work? [00:23:32] It's not done deploying yet, I had some issues [00:23:35] Should be almost done now [00:23:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:24:09] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:24:34] !log catrope@deploy1002 Finished scap: Update Kashmiri namespace names (T304790) (duration: 12m 29s) [00:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:41] T304790: Update Namespace translations on Ks Wiki - https://phabricator.wikimedia.org/T304790 [00:25:01] (BlazegraphJvmQuakeWarnGC) firing: (8) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [00:25:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23653 and previous config saved to /var/cache/conftool/dbconfig/20220330-002515-ladsgroup.json [00:25:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [00:25:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [00:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23654 and previous config saved to /var/cache/conftool/dbconfig/20220330-002523-ladsgroup.json [00:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:00] Juan_90264: OK now it's finished [00:27:31] Okay [00:27:35] And it seems to be working as far as I can tell? [00:27:50] But I can't read Arabic, so it's not that easy for me to verify [00:28:54] Now this is how it's working, the changes are working [00:29:28] Thanks RoanKattouw! [00:29:45] Thanks for bearing with me Juan_90264 ! Sorry it took so long [00:29:47] (03Merged) 10jenkins-bot: Revert "Revert "Revert "End migration mode""" [skins/Vector] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774987 (owner: 10Catrope) [00:30:29] No problem, RoanKattouw [00:32:43] I'm glad I was able to solve this task (well, just need to check and deploy) [00:34:00] Now bye [00:38:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:38:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23655 and previous config saved to /var/cache/conftool/dbconfig/20220330-010034-ladsgroup.json [01:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23656 and previous config saved to /var/cache/conftool/dbconfig/20220330-011004-ladsgroup.json [01:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:15:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23657 and previous config saved to /var/cache/conftool/dbconfig/20220330-011539-ladsgroup.json [01:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23658 and previous config saved to /var/cache/conftool/dbconfig/20220330-012509-ladsgroup.json [01:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300775)', diff saved to https://phabricator.wikimedia.org/P23659 and previous config saved to /var/cache/conftool/dbconfig/20220330-012542-marostegui.json [01:25:45] (JobUnavailable) firing: Reduced availability for job trafficserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:50] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [01:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23660 and previous config saved to /var/cache/conftool/dbconfig/20220330-013044-ladsgroup.json [01:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:51] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23661 and previous config saved to /var/cache/conftool/dbconfig/20220330-014014-ladsgroup.json [01:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P23662 and previous config saved to /var/cache/conftool/dbconfig/20220330-014047-marostegui.json [01:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23663 and previous config saved to /var/cache/conftool/dbconfig/20220330-014549-ladsgroup.json [01:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:46:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [01:46:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [01:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23664 and previous config saved to /var/cache/conftool/dbconfig/20220330-014621-ladsgroup.json [01:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23665 and previous config saved to /var/cache/conftool/dbconfig/20220330-014829-ladsgroup.json [01:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23666 and previous config saved to /var/cache/conftool/dbconfig/20220330-015519-ladsgroup.json [01:55:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [01:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [01:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23667 and previous config saved to /var/cache/conftool/dbconfig/20220330-015527-ladsgroup.json [01:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P23668 and previous config saved to /var/cache/conftool/dbconfig/20220330-015552-marostegui.json [01:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23669 and previous config saved to /var/cache/conftool/dbconfig/20220330-020334-ladsgroup.json [02:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:10:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300775)', diff saved to https://phabricator.wikimedia.org/P23670 and previous config saved to /var/cache/conftool/dbconfig/20220330-021058-marostegui.json [02:11:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [02:11:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [02:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:11:04] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [02:11:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300775)', diff saved to https://phabricator.wikimedia.org/P23671 and previous config saved to /var/cache/conftool/dbconfig/20220330-021111-marostegui.json [02:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23672 and previous config saved to /var/cache/conftool/dbconfig/20220330-021839-ladsgroup.json [02:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:15] (03PS1) 10Samwilson: Enable Realtime Preview on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775012 (https://phabricator.wikimedia.org/T302506) [02:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23673 and previous config saved to /var/cache/conftool/dbconfig/20220330-023344-ladsgroup.json [02:33:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [02:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [02:33:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:33:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [02:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [02:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [02:34:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [02:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:34:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:34:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23674 and previous config saved to /var/cache/conftool/dbconfig/20220330-023426-ladsgroup.json [02:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23675 and previous config saved to /var/cache/conftool/dbconfig/20220330-023634-ladsgroup.json [02:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23676 and previous config saved to /var/cache/conftool/dbconfig/20220330-024055-ladsgroup.json [02:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20220330-025139-ladsgroup.json [02:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23677 and previous config saved to /var/cache/conftool/dbconfig/20220330-025600-ladsgroup.json [02:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23678 and previous config saved to /var/cache/conftool/dbconfig/20220330-030649-ladsgroup.json [03:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:11:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23679 and previous config saved to /var/cache/conftool/dbconfig/20220330-031105-ladsgroup.json [03:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23680 and previous config saved to /var/cache/conftool/dbconfig/20220330-032154-ladsgroup.json [03:21:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [03:21:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [03:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:22:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23681 and previous config saved to /var/cache/conftool/dbconfig/20220330-032201-ladsgroup.json [03:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23682 and previous config saved to /var/cache/conftool/dbconfig/20220330-032610-ladsgroup.json [03:26:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:26:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23683 and previous config saved to /var/cache/conftool/dbconfig/20220330-032617-ladsgroup.json [03:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23684 and previous config saved to /var/cache/conftool/dbconfig/20220330-033920-ladsgroup.json [03:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:50:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23685 and previous config saved to /var/cache/conftool/dbconfig/20220330-035013-ladsgroup.json [03:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:54:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23686 and previous config saved to /var/cache/conftool/dbconfig/20220330-035425-ladsgroup.json [03:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23687 and previous config saved to /var/cache/conftool/dbconfig/20220330-040518-ladsgroup.json [04:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23688 and previous config saved to /var/cache/conftool/dbconfig/20220330-040930-ladsgroup.json [04:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23689 and previous config saved to /var/cache/conftool/dbconfig/20220330-042023-ladsgroup.json [04:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23690 and previous config saved to /var/cache/conftool/dbconfig/20220330-042435-ladsgroup.json [04:24:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:24:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [04:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:24:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23691 and previous config saved to /var/cache/conftool/dbconfig/20220330-042443-ladsgroup.json [04:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:16] (BlazegraphJvmQuakeWarnGC) firing: (8) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [04:35:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23692 and previous config saved to /var/cache/conftool/dbconfig/20220330-043528-ladsgroup.json [04:35:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:35:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23693 and previous config saved to /var/cache/conftool/dbconfig/20220330-043536-ladsgroup.json [04:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23694 and previous config saved to /var/cache/conftool/dbconfig/20220330-043744-ladsgroup.json [04:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23695 and previous config saved to /var/cache/conftool/dbconfig/20220330-043758-ladsgroup.json [04:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23696 and previous config saved to /var/cache/conftool/dbconfig/20220330-045249-ladsgroup.json [04:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23697 and previous config saved to /var/cache/conftool/dbconfig/20220330-045303-ladsgroup.json [04:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [04:57:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [04:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T297189)', diff saved to https://phabricator.wikimedia.org/P23698 and previous config saved to /var/cache/conftool/dbconfig/20220330-045747-marostegui.json [04:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:54] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [05:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 for downgrade', diff saved to https://phabricator.wikimedia.org/P23699 and previous config saved to /var/cache/conftool/dbconfig/20220330-050406-root.json [05:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23700 and previous config saved to /var/cache/conftool/dbconfig/20220330-050754-ladsgroup.json [05:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23701 and previous config saved to /var/cache/conftool/dbconfig/20220330-050808-ladsgroup.json [05:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23702 and previous config saved to /var/cache/conftool/dbconfig/20220330-051012-root.json [05:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:08] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/775193 (https://phabricator.wikimedia.org/T304933) [05:14:29] (03CR) 10Marostegui: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/775193 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [05:15:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160 for reboot', diff saved to https://phabricator.wikimedia.org/P23703 and previous config saved to /var/cache/conftool/dbconfig/20220330-051524-root.json [05:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:17] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774989 [05:21:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774989 (owner: 10Marostegui) [05:22:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P23704 and previous config saved to /var/cache/conftool/dbconfig/20220330-052241-root.json [05:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23705 and previous config saved to /var/cache/conftool/dbconfig/20220330-052259-ladsgroup.json [05:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:23:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:23:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [05:23:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [05:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23706 and previous config saved to /var/cache/conftool/dbconfig/20220330-052312-ladsgroup.json [05:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23707 and previous config saved to /var/cache/conftool/dbconfig/20220330-052320-ladsgroup.json [05:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:23:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [05:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23708 and previous config saved to /var/cache/conftool/dbconfig/20220330-052344-ladsgroup.json [05:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23709 and previous config saved to /var/cache/conftool/dbconfig/20220330-052516-root.json [05:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23710 and previous config saved to /var/cache/conftool/dbconfig/20220330-052525-ladsgroup.json [05:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:45] (JobUnavailable) firing: Reduced availability for job trafficserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:28:12] (03PS1) 10Marostegui: site.pp: Add db1161 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/775194 [05:28:55] (03Abandoned) 10Marostegui: site.pp: Add db1161 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/775194 (owner: 10Marostegui) [05:30:53] (03PS1) 10Marostegui: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/775195 (https://phabricator.wikimedia.org/T303798) [05:31:23] (03PS1) 10Marostegui: wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/775196 (https://phabricator.wikimedia.org/T303798) [05:31:54] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/775195 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [05:32:09] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/775196 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [05:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23711 and previous config saved to /var/cache/conftool/dbconfig/20220330-053640-ladsgroup.json [05:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: After kernel upgrade', diff saved to https://phabricator.wikimedia.org/P23712 and previous config saved to /var/cache/conftool/dbconfig/20220330-053745-root.json [05:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23713 and previous config saved to /var/cache/conftool/dbconfig/20220330-054021-root.json [05:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23714 and previous config saved to /var/cache/conftool/dbconfig/20220330-054032-ladsgroup.json [05:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:45:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298557)', diff saved to https://phabricator.wikimedia.org/P23715 and previous config saved to /var/cache/conftool/dbconfig/20220330-054548-marostegui.json [05:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:56] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [05:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P23716 and previous config saved to /var/cache/conftool/dbconfig/20220330-055045-root.json [05:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23717 and previous config saved to /var/cache/conftool/dbconfig/20220330-055145-ladsgroup.json [05:51:48] !log dbmaint s6@eqiad T297189 [05:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:54] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [05:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23718 and previous config saved to /var/cache/conftool/dbconfig/20220330-055525-root.json [05:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23719 and previous config saved to /var/cache/conftool/dbconfig/20220330-055537-ladsgroup.json [05:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23720 and previous config saved to /var/cache/conftool/dbconfig/20220330-060548-root.json [06:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P23721 and previous config saved to /var/cache/conftool/dbconfig/20220330-060650-ladsgroup.json [06:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:44] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [06:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:05] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23722 and previous config saved to /var/cache/conftool/dbconfig/20220330-061029-root.json [06:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23723 and previous config saved to /var/cache/conftool/dbconfig/20220330-061042-ladsgroup.json [06:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:10:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:10:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23724 and previous config saved to /var/cache/conftool/dbconfig/20220330-061051-ladsgroup.json [06:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:09] !log restart rsyslogd on ml-serve1001 [06:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23725 and previous config saved to /var/cache/conftool/dbconfig/20220330-061259-ladsgroup.json [06:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [06:15:22] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:47] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet [06:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23726 and previous config saved to /var/cache/conftool/dbconfig/20220330-062052-root.json [06:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298565)', diff saved to https://phabricator.wikimedia.org/P23727 and previous config saved to /var/cache/conftool/dbconfig/20220330-062155-ladsgroup.json [06:21:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:21:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:22:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23728 and previous config saved to /var/cache/conftool/dbconfig/20220330-062203-ladsgroup.json [06:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23729 and previous config saved to /var/cache/conftool/dbconfig/20220330-062533-root.json [06:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23730 and previous config saved to /var/cache/conftool/dbconfig/20220330-062804-ladsgroup.json [06:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet [06:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki-history-drop-snapshot.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet [06:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:40] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [06:34:42] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-serve1001.eqiad.wmnet [06:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:02] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [06:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23731 and previous config saved to /var/cache/conftool/dbconfig/20220330-063522-ladsgroup.json [06:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23732 and previous config saved to /var/cache/conftool/dbconfig/20220330-063556-root.json [06:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:01] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:39:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:39:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet [06:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23733 and previous config saved to /var/cache/conftool/dbconfig/20220330-064037-root.json [06:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [06:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23734 and previous config saved to /var/cache/conftool/dbconfig/20220330-064309-ladsgroup.json [06:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:15] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 61 probes of 672 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:46:45] (03CR) 10Majavah: httpbb: follow-up to 'fix status code checks for CodeReview redirects' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [06:48:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet [06:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:35] !log updated scap to 4.5.0 on all hosts - T304134 [06:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:40] T304134: Deploy Scap version 4.5.0 - https://phabricator.wikimedia.org/T304134 [06:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23735 and previous config saved to /var/cache/conftool/dbconfig/20220330-065027-ladsgroup.json [06:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23736 and previous config saved to /var/cache/conftool/dbconfig/20220330-065100-root.json [06:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet [06:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23737 and previous config saved to /var/cache/conftool/dbconfig/20220330-065814-ladsgroup.json [06:58:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:58:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [06:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23738 and previous config saved to /var/cache/conftool/dbconfig/20220330-065822-ladsgroup.json [06:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:55] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) p:05Triage→03High a:05Joe→03None >>! In T303857#7811250, @herron wrote: > Rem... [07:00:05] Amir1, awight, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T0700). [07:00:05] samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] Amir1, awight, Urbanecm, or taavi: I'm here. [07:01:18] I'm here but would prefer that someone else would deploy [07:03:20] (03CR) 10Majavah: [C: 03+2] Enable Realtime Preview on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775012 (https://phabricator.wikimedia.org/T302506) (owner: 10Samwilson) [07:03:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet [07:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:20] (03Merged) 10jenkins-bot: Enable Realtime Preview on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775012 (https://phabricator.wikimedia.org/T302506) (owner: 10Samwilson) [07:04:43] samwilson: pulled to mwdebug1001, please test [07:05:15] taavi: testing now, thanks [07:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P23739 and previous config saved to /var/cache/conftool/dbconfig/20220330-070532-ladsgroup.json [07:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:45] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:05:53] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) [07:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23740 and previous config saved to /var/cache/conftool/dbconfig/20220330-070604-root.json [07:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] !log restart rsyslog on ml-serve1002 [07:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:56] taavi: everything looks good [07:07:03] thanks, syncing [07:08:06] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:775012|Enable Realtime Preview on testwiki (T302506)]] (duration: 00m 56s) [07:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:11] T302506: Deploy to test wiki for user testing purposes - https://phabricator.wikimedia.org/T302506 [07:08:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:26] deployed! [07:08:37] !log UTC morning deploys done [07:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:09:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:15] taavi: thank you! :-) [07:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2008.codfw.wmnet [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet [07:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:52] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet [07:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:58] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl1001.eqiad.wmnet [07:15:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:16:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:16:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T297189)', diff saved to https://phabricator.wikimedia.org/P23741 and previous config saved to /var/cache/conftool/dbconfig/20220330-071650-marostegui.json [07:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:57] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:36] (03PS1) 10Filippo Giunchedi: prometheus: pin trafficserver/varnish jobs to class not cluster [puppet] - 10https://gerrit.wikimedia.org/r/775251 [07:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P23742 and previous config saved to /var/cache/conftool/dbconfig/20220330-072037-ladsgroup.json [07:20:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:20:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [07:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23743 and previous config saved to /var/cache/conftool/dbconfig/20220330-072045-ladsgroup.json [07:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:41] (03CR) 10MVernon: [C: 03+2] swift::ring: deploy by tarball not individual files [puppet] - 10https://gerrit.wikimedia.org/r/769943 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [07:22:05] (03CR) 10Filippo Giunchedi: [C: 03+2] logging: bump alerts logs retention [puppet] - 10https://gerrit.wikimedia.org/r/774364 (https://phabricator.wikimedia.org/T304924) (owner: 10Filippo Giunchedi) [07:24:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298557)', diff saved to https://phabricator.wikimedia.org/P23744 and previous config saved to /var/cache/conftool/dbconfig/20220330-072613-marostegui.json [07:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:19] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:26:19] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl1002.eqiad.wmnet [07:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:47] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet [07:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:29] !log updating libapache2-mod-auth-cas on bullseye hosts [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet [07:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:16] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/34619/" [puppet] - 10https://gerrit.wikimedia.org/r/775251 (owner: 10Filippo Giunchedi) [07:33:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet [07:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [07:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:50] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet [07:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:28] (03PS1) 10Marostegui: filtered_tables.txt: Remove ft_title and ft_namespace [puppet] - 10https://gerrit.wikimedia.org/r/775253 (https://phabricator.wikimedia.org/T297189) [07:39:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2002.codfw.wmnet [07:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:51] (03PS1) 10DCausse: wdqs: tune jvmquake settings [puppet] - 10https://gerrit.wikimedia.org/r/775254 [07:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P23745 and previous config saved to /var/cache/conftool/dbconfig/20220330-074118-marostegui.json [07:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [07:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:45] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet [07:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2003.codfw.wmnet [07:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:51] (03CR) 10Jakob: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [07:46:42] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.0 - https://phabricator.wikimedia.org/T271228 (10ayounsi) It's back! https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.2.0 :) [07:48:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2003.codfw.wmnet [07:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet [07:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter2004.codfw.wmnet [07:50:44] (03CR) 10Ayounsi: [C: 03+1] ipmi: add remove_boot_override, improve force_pxe (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [07:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T297189)', diff saved to https://phabricator.wikimedia.org/P23746 and previous config saved to /var/cache/conftool/dbconfig/20220330-075303-marostegui.json [07:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:09] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:54:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter2004.codfw.wmnet [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:49] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1003.eqiad.wmnet [07:54:52] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: fix message for downtime (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 (owner: 10Volans) [07:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:31] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet [07:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:56] ah T304134 got us the latest scap version :] [07:55:56] T304134: Deploy Scap version 4.5.0 - https://phabricator.wikimedia.org/T304134 [07:56:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P23747 and previous config saved to /var/cache/conftool/dbconfig/20220330-075623-marostegui.json [07:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:35] I am going to promote 1.39.0-wmf.5 to group1 wikis [07:57:36] hashar: could you try the new integrated deploy-promote? :) [07:57:47] sure [07:57:50] what is the command? [07:58:19] I was following the instructions and going to hit: `~/release/bin/deploy-promote group1` [07:58:24] it behaves exactly the same as the old `rools/release` deploy promote, but it's part of scap now [07:58:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23748 and previous config saved to /var/cache/conftool/dbconfig/20220330-075838-ladsgroup.json [07:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:58:47] I don't see it in `scap --help` [07:59:03] oh [07:59:05] here it is [07:59:12] `scap deploy-promote` [07:59:26] do you want to pair on it? I would love to see it in action [07:59:26] I am going to grab a coffee first cause clearly I am not fully awake yet despite waking up at 5am :D [07:59:42] oh man... [07:59:44] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: call Ipmi.remove_boot_override [cookbooks] - 10https://gerrit.wikimedia.org/r/774927 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [08:00:05] hashar and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T0800). [08:00:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1003.eqiad.wmnet [08:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:27] sure [08:01:42] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 (owner: 10Volans) [08:02:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host orespoolcounter1004.eqiad.wmnet [08:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet [08:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:16] !log depool cp2032 for reimage - T290005 [08:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:21] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:04:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23749 and previous config saved to /var/cache/conftool/dbconfig/20220330-080447-ladsgroup.json [08:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:05:18] (03PS2) 10MMandere: site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005) [08:05:36] Jaime and I are pairing the group1 promotion [08:07:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host orespoolcounter1004.eqiad.wmnet [08:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:48] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2032 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773555 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P23750 and previous config saved to /var/cache/conftool/dbconfig/20220330-080808-marostegui.json [08:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:18] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.5 [08:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:19] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.5 (duration: 01m 00s) [08:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:31] success! [08:10:57] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: fix message for downtime (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 (owner: 10Volans) [08:11:15] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2032.codfw.wmnet with OS buster [08:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298557)', diff saved to https://phabricator.wikimedia.org/P23751 and previous config saved to /var/cache/conftool/dbconfig/20220330-081128-marostegui.json [08:11:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:11:30] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster [08:11:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:33] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23752 and previous config saved to /var/cache/conftool/dbconfig/20220330-081343-ladsgroup.json [08:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:15] (03PS8) 10Ayounsi: Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [08:14:17] (03PS5) 10Ayounsi: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) [08:14:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:14:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] httpd: Globally enable wmfjson [puppet] - 10https://gerrit.wikimedia.org/r/572702 (owner: 10Alexandros Kosiaris) [08:18:13] (03CR) 10Ayounsi: [C: 03+2] Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [08:18:47] (03Merged) 10jenkins-bot: Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [08:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23753 and previous config saved to /var/cache/conftool/dbconfig/20220330-081952-ladsgroup.json [08:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:50] !log temporarily apply log only RPF filter on eqiad analytics-a [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] (03PS2) 10Filippo Giunchedi: prometheus: pin ats-tls targets to class not cluster [puppet] - 10https://gerrit.wikimedia.org/r/775251 [08:23:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P23754 and previous config saved to /var/cache/conftool/dbconfig/20220330-082314-marostegui.json [08:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:16] (BlazegraphJvmQuakeWarnGC) firing: (8) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:28:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23755 and previous config saved to /var/cache/conftool/dbconfig/20220330-082848-ladsgroup.json [08:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:16] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2032.codfw.wmnet with reason: host reimage [08:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:39] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2032.codfw.wmnet with reason: host reimage [08:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P23756 and previous config saved to /var/cache/conftool/dbconfig/20220330-083458-ladsgroup.json [08:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T297189)', diff saved to https://phabricator.wikimedia.org/P23757 and previous config saved to /var/cache/conftool/dbconfig/20220330-083819-marostegui.json [08:38:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:38:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:38:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3001.wikimedia.org [08:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:25] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:38:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T297189)', diff saved to https://phabricator.wikimedia.org/P23758 and previous config saved to /var/cache/conftool/dbconfig/20220330-083826-marostegui.json [08:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4001.wikimedia.org [08:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3001.wikimedia.org [08:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23759 and previous config saved to /var/cache/conftool/dbconfig/20220330-084353-ladsgroup.json [08:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:44:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:44:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23760 and previous config saved to /var/cache/conftool/dbconfig/20220330-084425-ladsgroup.json [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:30] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:14] (03PS2) 10Volans: ipmi: add remove_boot_override, improve force_pxe [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) [08:46:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23761 and previous config saved to /var/cache/conftool/dbconfig/20220330-084633-ladsgroup.json [08:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4001.wikimedia.org [08:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:58] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [08:48:06] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix message for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 (owner: 10Volans) [08:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P23762 and previous config saved to /var/cache/conftool/dbconfig/20220330-085003-ladsgroup.json [08:50:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:50:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:50:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23763 and previous config saved to /var/cache/conftool/dbconfig/20220330-085010-ladsgroup.json [08:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] wow that’s a lot of mismatching fields [08:51:02] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix message for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/774882 (owner: 10Volans) [08:52:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5001.wikimedia.org [08:52:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6001.wikimedia.org [08:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:08] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2032.codfw.wmnet with OS buster [08:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster com... [08:56:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5001.wikimedia.org [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6001.wikimedia.org [08:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:05] 10SRE-swift-storage, 10Commons, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10TheDJ) For some reasons the file was uploaded to the swift storage engine with incorrect mime type I guess... A shell user for... [09:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23764 and previous config saved to /var/cache/conftool/dbconfig/20220330-090138-ladsgroup.json [09:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:50] 10SRE-swift-storage, 10Commons, 10Wikimedia-Site-requests, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10TheDJ) [09:05:49] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10Joe) a:03Joe Unassigned by mistake, apologies. @dancy can you confirm my understanding... [09:08:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1003.wikimedia.org [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:30] !log pool cp2032 with HAProxy as TLS termination layer - T290005 [09:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:35] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:12:03] (03PS1) 10Marostegui: mysql/upgrade.py: Allow buffer pool dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/775260 (https://phabricator.wikimedia.org/T303498) [09:12:22] 10SRE-OnFire, 10DBA, 10Patch-For-Review, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Marostegui) I haven't been able to replicate the crashes/see more errors when upgrading, so I am co... [09:12:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1003.wikimedia.org [09:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:20] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org [09:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23765 and previous config saved to /var/cache/conftool/dbconfig/20220330-091643-ladsgroup.json [09:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:01] (03PS1) 10Giuseppe Lavagetto: deployment_server: fix handling of docker group [puppet] - 10https://gerrit.wikimedia.org/r/775261 [09:17:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2003.wikimedia.org [09:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:11] (03CR) 10Ayounsi: [C: 03+1] ipmi: add remove_boot_override, improve force_pxe (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [09:23:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34620/console" [puppet] - 10https://gerrit.wikimedia.org/r/775261 (owner: 10Giuseppe Lavagetto) [09:24:56] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [09:24:58] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ores1004.eqiad.wmnet [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:18] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [09:25:20] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ores1004.eqiad.wmnet [09:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:45] (JobUnavailable) firing: Reduced availability for job trafficserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:26:20] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [09:26:22] !log klausman@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ores1004.eqiad.wmnet [09:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:39] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1004.eqiad.wmnet [09:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] !log ganeti1025:~$ sudo sysctl -w sysctl net.ipv6.conf.analytics.accept_ra=0 - T305034 [09:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:28] T305034: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 [09:31:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23766 and previous config saved to /var/cache/conftool/dbconfig/20220330-093148-ladsgroup.json [09:31:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:31:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23767 and previous config saved to /var/cache/conftool/dbconfig/20220330-093156-ladsgroup.json [09:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:07] (03CR) 10Volans: "addressed comment" [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [09:32:07] !log depool cp2030 for reimage - T290005 [09:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:12] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:33:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deployment_server: fix handling of docker group [puppet] - 10https://gerrit.wikimedia.org/r/775261 (owner: 10Giuseppe Lavagetto) [09:33:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23768 and previous config saved to /var/cache/conftool/dbconfig/20220330-093324-ladsgroup.json [09:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23769 and previous config saved to /var/cache/conftool/dbconfig/20220330-093403-ladsgroup.json [09:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:05] (03PS2) 10MMandere: site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005) [09:35:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1004.eqiad.wmnet [09:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:32] (03PS3) 10Volans: ipmi: add remove_boot_override, improve force_pxe [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) [09:35:56] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2030 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773556 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:38:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:38:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2030.codfw.wmnet with OS buster [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:49] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster [09:41:01] (03CR) 10Phedenskog: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [09:41:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T297189)', diff saved to https://phabricator.wikimedia.org/P23770 and previous config saved to /var/cache/conftool/dbconfig/20220330-094146-marostegui.json [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:51] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [09:43:31] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1005.eqiad.wmnet [09:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1001.wikimedia.org [09:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23771 and previous config saved to /var/cache/conftool/dbconfig/20220330-094829-ladsgroup.json [09:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1001.wikimedia.org [09:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23772 and previous config saved to /var/cache/conftool/dbconfig/20220330-094908-ladsgroup.json [09:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:22] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:04] (03PS1) 10Muehlenhoff: Failover IDP after reboot [dns] - 10https://gerrit.wikimedia.org/r/775262 [09:51:08] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1005.eqiad.wmnet [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:05] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1006.eqiad.wmnet [09:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P23773 and previous config saved to /var/cache/conftool/dbconfig/20220330-095651-marostegui.json [09:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:36] (03CR) 10JMeybohm: [C: 03+1] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:58:36] (03PS1) 10Jelto: gitlab: reduce backup_keep_time to save disk space [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) [09:59:15] (03PS45) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [09:59:24] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:59:39] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:01:12] !log cumin1001:~$ sudo cumin 'ganeti[1005-1028].eqiad.wmnet' 'sysctl -w net.ipv6.conf.analytics.accept_ra=0' - T305034 [10:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:17] T305034: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 [10:01:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34621/console" [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:01:32] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:03:04] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2030.codfw.wmnet with reason: host reimage [10:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P23774 and previous config saved to /var/cache/conftool/dbconfig/20220330-100333-ladsgroup.json [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I manually verified everything and would +2 this, but can't in this codebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773966 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [10:04:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23775 and previous config saved to /var/cache/conftool/dbconfig/20220330-100413-ladsgroup.json [10:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:22] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:06:36] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1006.eqiad.wmnet [10:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300775)', diff saved to https://phabricator.wikimedia.org/P23776 and previous config saved to /var/cache/conftool/dbconfig/20220330-100654-marostegui.json [10:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:00] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:08:28] "Request from 88.97.96.89 via cp3052.esams.wmnet, ATS/8.0.8 [10:08:28] Error: 411, Content Length Required at 2022-03-30 10:07:16 GMT" [10:10:22] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:24] (03CR) 10Jcrespo: "typo" [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:11:34] 10SRE, 10Infrastructure-Foundations, 10netops: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) p:05Medium→03Low a:03MoritzMuehlenhoff After chatting with Moritz I pushed a manual fix and confirmed that the route was gone after the expiring timer. T... [10:11:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P23777 and previous config saved to /var/cache/conftool/dbconfig/20220330-101156-marostegui.json [10:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:13:05] (03CR) 10Jcrespo: "No input here other than please test Bacula backups, too. My biggest worry is if frequent dumps are done, bacula could always be copying a" [puppet] - 10https://gerrit.wikimedia.org/r/775265 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:13:27] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: sort ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/775271 [10:14:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/775271 (owner: 10Giuseppe Lavagetto) [10:14:49] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1007.eqiad.wmnet [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23778 and previous config saved to /var/cache/conftool/dbconfig/20220330-101839-ladsgroup.json [10:18:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:18:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:18:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23779 and previous config saved to /var/cache/conftool/dbconfig/20220330-101847-ladsgroup.json [10:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23780 and previous config saved to /var/cache/conftool/dbconfig/20220330-101918-ladsgroup.json [10:19:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:19:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:19:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23781 and previous config saved to /var/cache/conftool/dbconfig/20220330-101931-ladsgroup.json [10:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23782 and previous config saved to /var/cache/conftool/dbconfig/20220330-102138-ladsgroup.json [10:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P23783 and previous config saved to /var/cache/conftool/dbconfig/20220330-102200-marostegui.json [10:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1007.eqiad.wmnet [10:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:42] (03PS46) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [10:26:42] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2030.codfw.wmnet with OS buster [10:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:50] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster com... [10:27:00] (03PS18) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [10:27:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T297189)', diff saved to https://phabricator.wikimedia.org/P23784 and previous config saved to /var/cache/conftool/dbconfig/20220330-102701-marostegui.json [10:27:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:27:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:07] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [10:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:19] (03PS19) 10Elukey: Add istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [10:27:45] (03PS20) 10Elukey: Add istio-cni plugin configs to ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/773185 (https://phabricator.wikimedia.org/T297612) [10:34:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34622/console" [puppet] - 10https://gerrit.wikimedia.org/r/773185 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [10:34:51] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1008.eqiad.wmnet [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23785 and previous config saved to /var/cache/conftool/dbconfig/20220330-103644-ladsgroup.json [10:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P23786 and previous config saved to /var/cache/conftool/dbconfig/20220330-103705-marostegui.json [10:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:41] !log pool cp2030 with HAProxy as TLS termination layer - T290005 [10:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:48] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:40:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1008.eqiad.wmnet [10:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] external_clouds_vendors: sort ip ranges [puppet] - 10https://gerrit.wikimedia.org/r/775271 (owner: 10Giuseppe Lavagetto) [10:48:47] (03PS1) 10JMeybohm: Add controller_sync_error_count metric [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/775277 (https://phabricator.wikimedia.org/T304092) [10:50:02] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP after reboot [dns] - 10https://gerrit.wikimedia.org/r/775262 (owner: 10Muehlenhoff) [10:51:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23787 and previous config saved to /var/cache/conftool/dbconfig/20220330-105149-ladsgroup.json [10:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:01] !log installing glibc updates from Bullseye 11.3 point release [10:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300775)', diff saved to https://phabricator.wikimedia.org/P23788 and previous config saved to /var/cache/conftool/dbconfig/20220330-105210-marostegui.json [10:52:13] (03PS1) 10Vivian Rook: upgrade codfw1dev to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/775278 (https://phabricator.wikimedia.org/T304694) [10:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:15] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:52:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:52:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [10:52:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: Maintenance [10:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: Maintenance [10:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:40] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1009.eqiad.wmnet [10:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:39] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:59:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1009.eqiad.wmnet [10:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:45] (03CR) 10David Caro: [C: 03+1] upgrade codfw1dev to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/775278 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [11:04:55] (03CR) 10Vivian Rook: [C: 03+2] upgrade codfw1dev to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/775278 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [11:05:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:05:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:06:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:06:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 12 hosts with reason: Maintenance [11:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 12 hosts with reason: Maintenance [11:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23789 and previous config saved to /var/cache/conftool/dbconfig/20220330-110654-ladsgroup.json [11:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:06:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:07:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23790 and previous config saved to /var/cache/conftool/dbconfig/20220330-110701-ladsgroup.json [11:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:10:45] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:11:07] <_joe_> good grief [11:11:14] sigh [11:11:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "no service comes to mind that could be using some source address trick that could broke as a result of a rpfilter" [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:11:31] Deja vu [11:11:46] * volans here [11:12:06] <_joe_> akosiaris: doing a roll restart in eqiad [11:12:10] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: sync [11:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:17] * jayme around [11:12:25] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:12:27] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [11:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:45] so, this is actually being worked on https://phabricator.wikimedia.org/T291707#7813403 We are waiting for mvolz to merge and deploy https://gerrit.wikimedia.org/r/774848 [11:12:46] <_joe_> I love applying the "kick it until it runs" method with this thing [11:13:15] and then we will have our readiness probe. We 'll add a bit more capacity and hopefully never see this again. [11:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23791 and previous config saved to /var/cache/conftool/dbconfig/20220330-111316-ladsgroup.json [11:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:15:52] !log depool cp2028 for reimage - T290005 [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:16:02] (03CR) 10Ayounsi: [C: 03+2] Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:16:48] (03Merged) 10jenkins-bot: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:17:40] (03PS2) 10MMandere: site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005) [11:19:09] !log apply urpf strict filter to eqiad cloud-hosts vlan - T285461 [11:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23792 and previous config saved to /var/cache/conftool/dbconfig/20220330-111911-ladsgroup.json [11:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:14] T285461: Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:19:29] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2028 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773557 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:23:25] (03PS1) 10Ayounsi: Enable urpf strict on codfw cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/775279 (https://phabricator.wikimedia.org/T285461) [11:24:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Enable urpf strict on codfw cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/775279 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:24:36] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2028.codfw.wmnet with OS buster [11:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:45] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster [11:25:23] (03CR) 10Ayounsi: [C: 03+2] Enable urpf strict on codfw cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/775279 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:26:01] (03Merged) 10jenkins-bot: Enable urpf strict on codfw cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/775279 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [11:27:15] (03PS8) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [11:27:20] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23793 and previous config saved to /var/cache/conftool/dbconfig/20220330-112821-ladsgroup.json [11:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:22] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:30:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here! [11:30:49] !log updating libapache2-mod-auth-cas on buster hosts [11:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23794 and previous config saved to /var/cache/conftool/dbconfig/20220330-113416-ladsgroup.json [11:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:41] (03PS1) 10Ayounsi: analytics1-a-eqiad: replace firewall filter with strict uRPF [homer/public] - 10https://gerrit.wikimedia.org/r/775280 (https://phabricator.wikimedia.org/T298087) [11:35:43] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775281 (https://phabricator.wikimedia.org/T135991) [11:35:58] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775281 (https://phabricator.wikimedia.org/T135991) [11:37:53] (03CR) 10Ladsgroup: [C: 03+1] filtered_tables.txt: Remove ft_title and ft_namespace [puppet] - 10https://gerrit.wikimedia.org/r/775253 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui) [11:38:01] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Remove ft_title and ft_namespace [puppet] - 10https://gerrit.wikimedia.org/r/775253 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui) [11:39:38] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) [11:40:10] (03CR) 10Ayounsi: "Example diff on cr1-eqiad:" [homer/public] - 10https://gerrit.wikimedia.org/r/775280 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [11:42:58] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775281 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:43:02] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage [11:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23795 and previous config saved to /var/cache/conftool/dbconfig/20220330-114326-ladsgroup.json [11:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:01] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/775196 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [11:44:48] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/775195 (https://phabricator.wikimedia.org/T303798) (owner: 10Marostegui) [11:45:45] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2028.codfw.wmnet with reason: host reimage [11:45:46] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for ulogd2 [puppet] - 10https://gerrit.wikimedia.org/r/775282 (https://phabricator.wikimedia.org/T135991) [11:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:19] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) I'm not following what you mean by USAGE 😅 Can you elaborate? [11:47:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:47:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [11:47:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 6 hosts with reason: Maintenance [11:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 6 hosts with reason: Maintenance [11:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:36] (03CR) 10Jaime Nuche: [C: 03+1] scap: make rsync use new compress algorithm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [11:48:36] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) A script usage to show when the script is executed without... [11:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23796 and previous config saved to /var/cache/conftool/dbconfig/20220330-114921-ladsgroup.json [11:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:25] ACKNOWLEDGEMENT - MD RAID on cp2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.24. Check system logs on 10.192.0.24 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T305047 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:49:29] 10SRE, 10ops-codfw: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10ops-monitoring-bot) [11:50:09] (03CR) 10Btullis: "I have done some work to verify the IP addresses that are contained within this CR." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23797 and previous config saved to /var/cache/conftool/dbconfig/20220330-115831-ladsgroup.json [11:58:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:58:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:58:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23798 and previous config saved to /var/cache/conftool/dbconfig/20220330-115839-ladsgroup.json [11:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:00] (03PS2) 10Ladsgroup: dbtools: Add master_finder.py [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) [12:00:25] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) I see. Done now. Can you take a look? [12:02:30] (03CR) 10Marostegui: [C: 03+1] "<3" [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [12:03:40] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Btw confirmed it works fine when the master is dead: ` Orde... [12:04:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23799 and previous config saved to /var/cache/conftool/dbconfig/20220330-120426-ladsgroup.json [12:04:28] (03CR) 10Ladsgroup: [C: 03+2] "\o/" [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [12:04:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:04:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:04:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:04:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23800 and previous config saved to /var/cache/conftool/dbconfig/20220330-120439-ladsgroup.json [12:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:56] (03Merged) 10jenkins-bot: dbtools: Add master_finder.py [software] - 10https://gerrit.wikimedia.org/r/774585 (https://phabricator.wikimedia.org/T281249) (owner: 10Ladsgroup) [12:05:08] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2028.codfw.wmnet with OS buster [12:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:17] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster com... [12:06:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23801 and previous config saved to /var/cache/conftool/dbconfig/20220330-120646-ladsgroup.json [12:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:56] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) 05Open→03Resolved Let's call this done. I'll pick up {T1... [12:07:29] (03CR) 10Ladsgroup: [C: 03+2] mysql/upgrade.py: Allow buffer pool dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/775260 (https://phabricator.wikimedia.org/T303498) (owner: 10Marostegui) [12:07:35] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) Many thanks for working on this! <3 [12:10:12] (03Merged) 10jenkins-bot: mysql/upgrade.py: Allow buffer pool dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/775260 (https://phabricator.wikimedia.org/T303498) (owner: 10Marostegui) [12:12:37] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Marostegui) 05Open→03Resolved Script merged [12:12:39] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:01] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:14:10] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775288 (https://phabricator.wikimedia.org/T135991) [12:16:53] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:20:00] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [12:21:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23802 and previous config saved to /var/cache/conftool/dbconfig/20220330-122151-ladsgroup.json [12:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:17] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:27:24] !log pool cp2028 with HAProxy as TLS termination layer - T290005 [12:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:29:51] (03PS1) 10Zabe: Revert "OATHUserRepository: Stop handling legacy single-key" [extensions/OATHAuth] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/774996 (https://phabricator.wikimedia.org/T305029) [12:30:32] hashar: you wanna deploy the revert? [12:32:08] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:32:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T297189)', diff saved to https://phabricator.wikimedia.org/P23804 and previous config saved to /var/cache/conftool/dbconfig/20220330-123249-marostegui.json [12:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [12:35:19] (03CR) 10Phuedx: [C: 03+1] Remove wgWMEIPAddressCopyActionEnabled from Beta and production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774904 (https://phabricator.wikimedia.org/T296469) (owner: 10Tchanders) [12:36:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23805 and previous config saved to /var/cache/conftool/dbconfig/20220330-123656-ladsgroup.json [12:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [12:39:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [12:39:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298557)', diff saved to https://phabricator.wikimedia.org/P23806 and previous config saved to /var/cache/conftool/dbconfig/20220330-123931-marostegui.json [12:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:41:28] !log start of templatelinks backfill on s3 (T299424) [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:34] T299424: Run maintenance script backfilling tl_title_id - https://phabricator.wikimedia.org/T299424 [12:48:07] zabe: I will deploy the fix later today [12:48:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move miscweb back to state monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/774916 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23807 and previous config saved to /var/cache/conftool/dbconfig/20220330-124908-ladsgroup.json [12:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:50:28] (03CR) 10JMeybohm: [C: 03+2] Move miscweb back to state monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/774916 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:51:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:51:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23808 and previous config saved to /var/cache/conftool/dbconfig/20220330-125201-ladsgroup.json [12:52:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:52:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:52:07] (03PS1) 10Ladsgroup: Enable videojs on all of DIP wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775294 (https://phabricator.wikimedia.org/T248418) [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:52:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:52:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [12:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [12:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23809 and previous config saved to /var/cache/conftool/dbconfig/20220330-125239-ladsgroup.json [12:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:48] (03PS1) 10Muehlenhoff: Move Prometheus Apache setup to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/775296 [12:54:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23810 and previous config saved to /var/cache/conftool/dbconfig/20220330-125447-ladsgroup.json [12:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:54:54] (03CR) 10Ottomata: "Couple of Qs on the network policy stuff, but a quick glance overall the rules seem to match the intention 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:56:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [12:57:51] ok [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:03:19] yay [13:04:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23811 and previous config saved to /var/cache/conftool/dbconfig/20220330-130413-ladsgroup.json [13:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23812 and previous config saved to /var/cache/conftool/dbconfig/20220330-130952-ladsgroup.json [13:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:32] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [13:11:40] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/775298 (https://phabricator.wikimedia.org/T135991) [13:11:44] (03CR) 10BBlack: [C: 03+1] Add *.k8s-staging.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/763717 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [13:13:46] (03PS47) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [13:14:59] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2001.codfw.wmnet [13:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2001.codfw.wmnet [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23813 and previous config saved to /var/cache/conftool/dbconfig/20220330-131918-ladsgroup.json [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:27] (03PS8) 10JMeybohm: Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) [13:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T297189)', diff saved to https://phabricator.wikimedia.org/P23814 and previous config saved to /var/cache/conftool/dbconfig/20220330-132033-marostegui.json [13:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:39] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [13:22:15] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2002.codfw.wmnet [13:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] (03CR) 10JMeybohm: [C: 03+2] Remove LVS for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/770504 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [13:24:16] (03CR) 10Herron: [C: 03+1] Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/775281 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:24:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23815 and previous config saved to /var/cache/conftool/dbconfig/20220330-132457-ladsgroup.json [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2002.codfw.wmnet [13:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] PROBLEM - Host miscweb.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [13:28:05] this is me [13:28:40] seems like we did not avoid the page after all ? [13:28:50] it probably needed to run on icinga first [13:28:51] that should not have paged [13:29:01] PROBLEM - Host miscweb.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [13:29:02] here [13:29:08] oh, it did...great [13:29:11] (meaning puppet agent, on the icinga host) [13:29:12] sorry folks [13:29:16] I did [13:29:26] (run puppet on icinga) [13:29:27] jayme: no worries, I'll leave you to it [13:29:32] ack [13:29:32] we did try to avoid that page, sigh [13:29:50] I wonder why it paged though [13:30:04] monitoring_setup as a state should not be paging ... [13:30:06] the alert has priority=SPITE set [13:30:11] lol [13:30:27] but it did :) [13:30:28] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.58:4111]) https://wikitech.wikimedia.org/wiki/PyBal [13:30:34] (03PS1) 10BBlack: discovery: add drmrs IP [cookbooks] - 10https://gerrit.wikimedia.org/r/775301 [13:30:43] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Routinator [puppet] - 10https://gerrit.wikimedia.org/r/775302 (https://phabricator.wikimedia.org/T135991) [13:30:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2003.codfw.wmnet [13:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:09] * jayme acked in VO [13:31:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/775301 (owner: 10BBlack) [13:31:51] (03CR) 10BBlack: [C: 03+2] discovery: add drmrs IP [cookbooks] - 10https://gerrit.wikimedia.org/r/775301 (owner: 10BBlack) [13:32:05] there's a sequence of puppet runs right, lvs hosts to update the exported resources then icinga host IIRC [13:32:37] (03CR) 10BBlack: [C: 03+1] "Good stuff, looks helpful, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 (owner: 10Volans) [13:32:39] Hi there, I was wondering if I could get a beta-only backport in during this window. I didn't get it on the schedule in advance, so if I just need to go add it to the next window I can make, that's what I'll do. [13:32:46] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.58:4111]) https://wikitech.wikimedia.org/wiki/PyBal [13:32:46] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:33:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2003.codfw.wmnet [13:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:45] !log restarting pybal on lvs1020 and lvs2010 [13:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/774948 is the change in question. I went to go add it, and then I realized there was a backport window happening now. [13:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23816 and previous config saved to /var/cache/conftool/dbconfig/20220330-133423-ladsgroup.json [13:34:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:34:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:34:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:34:32] (03Merged) 10jenkins-bot: discovery: add drmrs IP [cookbooks] - 10https://gerrit.wikimedia.org/r/775301 (owner: 10BBlack) [13:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23817 and previous config saved to /var/cache/conftool/dbconfig/20220330-133436-ladsgroup.json [13:34:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:34:41] JSherman: should be okay to do now, I think; do you want to self-serve or do you need someone to deploy? [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:08] That's good to hear; I need someone to deploy for me. [13:35:38] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.58:4111]) https://wikitech.wikimedia.org/wiki/PyBal [13:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P23818 and previous config saved to /var/cache/conftool/dbconfig/20220330-133538-marostegui.json [13:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] I'll add it into this backport window on the calendar so we have a record [13:35:52] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.58:4111]) https://wikitech.wikimedia.org/wiki/PyBal [13:36:41] !log restarting pybal on lvs1019 and lvs2009 [13:36:44] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:24] ok, I can deploy it then [13:37:31] thank you! [13:37:48] (03PS3) 10JMeybohm: Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) [13:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:38:58] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:40:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23819 and previous config saved to /var/cache/conftool/dbconfig/20220330-134002-ladsgroup.json [13:40:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [13:40:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [13:40:07] (03CR) 10JMeybohm: [C: 03+2] Move miscweb from it's own LVS VIP to k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/770506 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [13:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:40:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23820 and previous config saved to /var/cache/conftool/dbconfig/20220330-134010-ladsgroup.json [13:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] (03CR) 10Lucas Werkmeister (WMDE): Add surveys to enwiki on beta for QA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [13:41:54] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:41:59] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [13:42:04] ok, now it’s in the calendar [13:43:32] JSherman: is it okay to deploy when one of the “soft-depends on” changes is still open? [13:43:37] (03CR) 10Volans: [C: 03+2] ipmi: add remove_boot_override, improve force_pxe [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [13:43:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:44:45] (03PS2) 10JMeybohm: Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) [13:44:56] Yeah, it won't hurt anything. I expect the surveys to just not use the config elements for bits that aren't merged in yet. [13:45:21] ok [13:45:39] hm, but another thing [13:46:02] hashar: /srv/mediawiki-staging on deploy1002 is currently one commit ahead of upstream (group1 wikis to wmf.5) [13:46:11] is it okay if I merge a config change (beta-only) and rebase it? [13:46:17] (and then sync the -labs.php file just to be sure) [13:47:02] (03CR) 10Herron: [V: 03+2 C: 03+2] slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [13:47:03] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2001.codfw.wmnet [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23821 and previous config saved to /var/cache/conftool/dbconfig/20220330-134737-ladsgroup.json [13:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:48:08] (03PS2) 10Sbisson: Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) [13:50:01] 10SRE, 10Performance-Team, 10Traffic, 10Performance-Team-publish, 10Sustainability (Incident Followup): Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Krinkle) [13:50:33] (03Merged) 10jenkins-bot: ipmi: add remove_boot_override, improve force_pxe [software/spicerack] - 10https://gerrit.wikimedia.org/r/774926 (https://phabricator.wikimedia.org/T304434) (owner: 10Volans) [13:50:35] hm, the group1 to wmf.5 bump was already logged https://sal.toolforge.org/log/eiHg2X8B8Fs0LHO5JKiq [13:50:44] and seems to be in effect too, e.g. https://www.wikidata.org/wiki/Special:Version shows wmf.5 [13:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P23822 and previous config saved to /var/cache/conftool/dbconfig/20220330-135044-marostegui.json [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:51:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:10] but it’s not in operations/mediawiki-config.git yet [13:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:51:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:22] Lucas_WMDE: If the rebase gets messy for my change, I could rework it on my end. Also, looks like that other soft-depends just got merged. [13:51:55] JSherman: it’s not a problem with your change specifically, but I think I’ll decline deploying it now, sorry [13:51:58] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add istio-cni plugin configs to ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/773185 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [13:52:03] well, +2ed anyway [13:52:04] the deployment server is in an unexpected state and I don’t want to risk messing it up [13:52:16] hopefully it’ll all be resolved in time for the next window [13:52:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2001.codfw.wmnet [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:39] Lucas_WMDE: Understood! I'll just move it to the next window; this was an opportunistic request anyhow. Thanks for looking at it! [13:52:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2002.codfw.wmnet [13:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:46] ok! [13:54:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Not deployed during today’s UTC afternoon backport+config window due to unexpected Git state on deploy1002, but LGTM and should be good to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [13:55:45] !log stopping orchestrator for backend move T301315 [13:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:51] T301315: Move orchestrator from db2093 to db1115 - https://phabricator.wikimedia.org/T301315 [13:56:33] (03CR) 10Kormat: [V: 03+1 C: 03+2] orchestrator: Switch to db1115 as backend. [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [13:59:24] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:59:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2002.codfw.wmnet [13:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:36] jayme: the icinga config issue is because of Could not find any host matching 'miscweb.svc.eqiad.wmnet' [14:00:43] references by some nagios_service [14:01:06] like check_https_lvs_on_port!miscweb.discovery.wmnet!4111!/ [14:01:13] hmm...yeah. I was staring at it for a moment now [14:01:44] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2003.codfw.wmnet [14:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23823 and previous config saved to /var/cache/conftool/dbconfig/20220330-140242-ladsgroup.json [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:49] volans: is that generated from the lvs stanza? [14:03:06] this is the change I've applied (for context) https://gerrit.wikimedia.org/r/c/operations/puppet/+/770504/8 [14:05:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T297189)', diff saved to https://phabricator.wikimedia.org/P23824 and previous config saved to /var/cache/conftool/dbconfig/20220330-140549-marostegui.json [14:05:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:05:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [14:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P23825 and previous config saved to /var/cache/conftool/dbconfig/20220330-140556-marostegui.json [14:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:02] jayme: yes, line 1325 [14:06:05] just below your changes [14:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:14] (03CR) 10Muehlenhoff: [C: 03+2] coal: use Python 3, add cachelib dependency [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [14:07:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2003.codfw.wmnet [14:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] (03PS2) 10Volans: sre.cdn.roll-restart-varnish: add a new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 [14:09:15] volans: I think I don't get it. Why would removing the lvs stanza remove a host from (from icinga config only - I guess)? [14:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:10:59] Lucas_WMDE: sorry I am busy this afternoon. Looks like our new way to promote `scap deploy-promote` does not fetch from gerrit after the wikiversions.json change has been merged :) [14:11:09] jnuche: ^ a tiny bug in `scap deploy-promote` [14:11:11] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2093.codfw.wmnet with OS bullseye [14:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:32] isn’t it a push that’s missing? [14:11:44] I don’t see the wmf.1 commit in my local clone of the repo either [14:11:47] yeah apparently [14:11:58] (03PS1) 10Hashar: group1 wikis to 1.39.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775306 [14:12:19] we have a new version of scap which is now providing a subcommand to update the wikiversions.json [14:12:26] it is missing the push [14:12:29] hi, yes, we need a new release, the deploy-promote scap version currently deployed is not ready for prime time yet [14:12:31] jayme: I'm not familiar with teh puppet abstraction around service::catalog [14:12:40] I think it comes from modules/service/manifests/monitor.pp [14:12:43] or something went wrong in the code. Maybe cause on the first attempt I did not have an ssh-agent so the push would have failed [14:12:59] that uses get_services_for('monitoring') [14:13:34] (03CR) 10Hashar: [C: 03+2] "Jaime and I generated that change this morning and did the deploy. `scap deploy-promote` forgot to send it back to Gerrit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775306 (owner: 10Hashar) [14:13:44] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Papaul) @RobH I took a quick look at this yesterday, no luck. Since it is a new product i will recommend getting Dell help maybe this will save us time. [14:13:45] Lucas_WMDE: sorry for the back port config mess :\ [14:13:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2004.codfw.wmnet [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] no problem, I think the missed change wasn’t urgent [14:14:13] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775306 (owner: 10Hashar) [14:14:15] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Papaul) a:05Papaul→03RobH [14:15:32] volans: Maybe I'm unable to ask the right questions here. Let me try again: What does icinga mean by "Could not find any host matching"? DNS wise everything seems fine. So I guess it's a config object in icinga that is missing? [14:15:33] (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: add a new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 (owner: 10Volans) [14:15:50] !log deploy1002: `git fetch && git rebase` to catchup with `group1 wikis to 1.39.0-wmf.5` commit which did not get send to Gerrit but got deployed earlier today [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:55] fixed [14:15:56] jayme: ah sorry, that I can answer [14:16:32] there are 2 errors (you can run sudo icinga -v /etc/icinga/icinga.cfg) [14:16:35] Could not find any host matching 'miscweb.svc.eqiad.wmnet' (config file '/etc/nagios/nagios_service.cfg', starting on line 23347) [14:16:38] Could not expand hostgroups and/or hosts specified in service (config file '/etc/nagios/nagios_service.cfg', starting on line 23347) [14:17:12] the first one is because a check is defined as belonging to a host with hostname 'miscweb.svc.eqiad.wmnet', but there is no host defined with that hostname apparently [14:17:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23826 and previous config saved to /var/cache/conftool/dbconfig/20220330-141747-ladsgroup.json [14:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:49] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: add a new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 (owner: 10Volans) [14:19:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2004.codfw.wmnet [14:19:19] !log installing remaining tiff security updates [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:43] volans: ah, okay - thanks. So it is actually missing a piece in /etc/nagios/nagios_host.cfg [14:20:09] yes [14:20:28] that I think it's usually defined in modules/service/manifests/monitor.pp [14:20:35] but not 100% sure [14:20:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:02] ack - looking. thanks! [14:21:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:21:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] "NOTE: We skip creating hosts for non LVS based services, but rather assume they are created via other means" .. yeah :) [14:22:57] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2005.codfw.wmnet [14:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet [14:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:17] 10SRE, 10ops-codfw: Dell swiches testiong: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) [14:25:33] 10SRE, 10ops-codfw: Dell swiches testiong: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) p:05Triage→03Medium a:05ayounsi→03Papaul [14:26:13] jayme: lol, that's it! [14:26:21] probably, yes :D [14:26:43] im inclined to add "lvs: {}" but I'm also very afraid [14:27:44] yeah I guess I didn't really understand the first patch [14:27:47] there is a lot of magic derived from service::catalog, I'm not familiar with all of them [14:27:53] why are we keeping the lvs service def at all? [14:28:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298557)', diff saved to https://phabricator.wikimedia.org/P23827 and previous config saved to /var/cache/conftool/dbconfig/20220330-142823-marostegui.json [14:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:29] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:28:34] bblack: the definition in service.yaml you mean? [14:28:43] (03CR) 10Ottomata: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:28:59] I would actually like to keep the monitoring/probes part as well as the dnsdisc part [14:29:28] but ... there's no service there, right? [14:29:41] that's what I'm getting lost on [14:29:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2005.codfw.wmnet [14:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:53] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2006.codfw.wmnet [14:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2093.codfw.wmnet with reason: host reimage [14:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet [14:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:51] there is. But that service is reachable via LVS of k8s-ingress-wikikube (which is a envoy proxy forwarding traffic to miscweb inside the k8s clusters if SNI is miscweb.discovers.w) [14:31:34] on a different port though, right? [14:31:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet [14:31:41] so at the LVS level, it's still separate [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] plase ignore the port 4111 - I definitely messed up. That needt to be 30443 [14:32:01] ok, that explains my confusion! [14:32:08] sorry for that [14:32:11] so... it's the same service, at the LVS layer [14:32:18] yes [14:32:22] it's an L4 balancer, it only knows IPs and ports really [14:32:32] same ip's same port [14:32:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2093.codfw.wmnet with reason: host reimage [14:32:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23828 and previous config saved to /var/cache/conftool/dbconfig/20220330-143252-ladsgroup.json [14:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:32:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:32:59] also means they can't be failed over between DCs for discovery independently, either [14:32:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [14:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [14:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:25] (03PS1) 10Giuseppe Lavagetto: varnish: update comment in dynamic actions [puppet] - 10https://gerrit.wikimedia.org/r/775315 [14:33:27] (03PS1) 10Giuseppe Lavagetto: requestctl::client: install preview files for actions [puppet] - 10https://gerrit.wikimedia.org/r/775316 [14:33:34] (03CR) 10Ayounsi: [C: 03+1] Enable profile::auto_restarts::service for Routinator [puppet] - 10https://gerrit.wikimedia.org/r/775302 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:33:35] so all that's really left is monitoring some extra SNI/Host values? [14:34:25] bblack: AIUI if k8s-ingress-wikikube fails over to a DC, miscweb has to follow. But if miscweb fails over k8s-ingress-wikikube does not have to follow [14:34:32] (03CR) 10CDanis: [C: 03+1] varnish: update comment in dynamic actions [puppet] - 10https://gerrit.wikimedia.org/r/775315 (owner: 10Giuseppe Lavagetto) [14:34:42] just by pooling/depooling via dnsdisc [14:35:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2006.codfw.wmnet [14:35:09] well, it doesn't really match the model, though. Nothing's going to enforce the rule that miscweb has to follow. [14:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet [14:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] bblack: thats right. That is something that needs to be taken care of by cookbooks for example [14:36:49] the idea is that we don't want to loose the functionality of being able to decide which DC to pool for services that are running behing Ingress [14:36:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2001.wikimedia.org [14:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] jayme: let's move to a less-noisy channel, this is getting complicated :) [14:37:33] +1 [14:40:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23829 and previous config saved to /var/cache/conftool/dbconfig/20220330-144023-ladsgroup.json [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:43:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P23830 and previous config saved to /var/cache/conftool/dbconfig/20220330-144328-marostegui.json [14:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2001.wikimedia.org [14:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] PROBLEM - Check systemd state on idp2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:49] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) [14:47:43] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [14:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2007.codfw.wmnet [14:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2093.codfw.wmnet with OS bullseye [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2001.codfw.wmnet [14:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:10] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) [14:55:11] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [14:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23831 and previous config saved to /var/cache/conftool/dbconfig/20220330-145529-ladsgroup.json [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] RECOVERY - Check systemd state on idp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2007.codfw.wmnet [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:08] (03PS1) 10JMeybohm: Remove monitoring from kubernetes miscweb for now [puppet] - 10https://gerrit.wikimedia.org/r/775319 (https://phabricator.wikimedia.org/T290966) [14:56:22] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2008.codfw.wmnet [14:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2001.codfw.wmnet [14:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P23833 and previous config saved to /var/cache/conftool/dbconfig/20220330-145833-marostegui.json [14:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:22] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1002.eqiad.wmnet [14:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34625/console" [puppet] - 10https://gerrit.wikimedia.org/r/775319 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:01:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2008.codfw.wmnet [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:25] (03PS1) 10Steven Sun: Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) [15:05:20] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1002.eqiad.wmnet [15:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P23834 and previous config saved to /var/cache/conftool/dbconfig/20220330-150611-marostegui.json [15:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:16] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:06:33] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 2 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10dcausse) [15:06:47] (03PS1) 10MMandere: site: Reimage cp3056 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775321 (https://phabricator.wikimedia.org/T290005) [15:06:52] (03PS1) 10MMandere: site: Reimage cp4029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775322 (https://phabricator.wikimedia.org/T290005) [15:06:54] (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK there shouldn't be other unwanted consequences" [puppet] - 10https://gerrit.wikimedia.org/r/775319 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:06:57] (03PS1) 10MMandere: site: Reimage cp3057 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775323 (https://phabricator.wikimedia.org/T290005) [15:07:02] (03PS1) 10MMandere: site: Reimage cp4023 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775324 (https://phabricator.wikimedia.org/T290005) [15:07:04] (03PS1) 10MMandere: site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 [15:07:06] (03PS1) 10MMandere: site: Reimage cp6016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775326 (https://phabricator.wikimedia.org/T290005) [15:07:08] (03PS1) 10MMandere: site: Reimage cp5003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775327 (https://phabricator.wikimedia.org/T290005) [15:07:11] (03PS1) 10MMandere: site: Reimage cp6008 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775328 (https://phabricator.wikimedia.org/T290005) [15:07:20] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Remove monitoring from kubernetes miscweb for now [puppet] - 10https://gerrit.wikimedia.org/r/775319 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:09:18] (03CR) 10jerkins-bot: [V: 04-1] site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 (owner: 10MMandere) [15:09:33] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE) [15:10:01] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1003.eqiad.wmnet [15:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23835 and previous config saved to /var/cache/conftool/dbconfig/20220330-151034-ladsgroup.json [15:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:12:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298557)', diff saved to https://phabricator.wikimedia.org/P23836 and previous config saved to /var/cache/conftool/dbconfig/20220330-151338-marostegui.json [15:13:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [15:13:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [15:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:45] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:13:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298557)', diff saved to https://phabricator.wikimedia.org/P23837 and previous config saved to /var/cache/conftool/dbconfig/20220330-151346-marostegui.json [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:25] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [15:15:50] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1003.eqiad.wmnet [15:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:43] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores2009.codfw.wmnet [15:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:04] (03PS1) 10Kormat: mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) [15:17:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet [15:17:33] (03PS2) 10Kormat: mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) [15:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:08] (03PS2) 10MMandere: site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 (https://phabricator.wikimedia.org/T290005) [15:18:18] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:18:38] (03CR) 10Herron: admin: add tsepothoabala to deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [15:19:26] 10SRE, 10ops-codfw, 10Traffic: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10herron) p:05Triage→03High [15:20:05] yay, icinga back happy, thanks jayme! [15:20:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet [15:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P23838 and previous config saved to /var/cache/conftool/dbconfig/20220330-152116-marostegui.json [15:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores2009.codfw.wmnet [15:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:47] (03PS1) 10Volans: cookbooks.sre: SREBatchRunnerBase early assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/775332 [15:23:49] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: typo on CLI argument [cookbooks] - 10https://gerrit.wikimedia.org/r/775333 [15:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23839 and previous config saved to /var/cache/conftool/dbconfig/20220330-152539-ladsgroup.json [15:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:26:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:26:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23840 and previous config saved to /var/cache/conftool/dbconfig/20220330-152613-ladsgroup.json [15:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:20] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ores1001.eqiad.wmnet [15:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:41] 10SRE, 10Generated Data Platform, 10serviceops, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10herron) p:05Triage→03Medium [15:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23841 and previous config saved to /var/cache/conftool/dbconfig/20220330-152821-ladsgroup.json [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for prometheus-blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775288 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:29:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/775298 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:30:45] (JobUnavailable) firing: (2) Reduced availability for job trafficserver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:45] (03CR) 10Jforrester: "The "override" used the official SVG version from Commons, https://commons.wikimedia.org/wiki/File:Wikipedia-logo-v2-zh-hans.svg – can you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [15:31:08] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:31:59] (03CR) 10Filippo Giunchedi: "LGTM, could you change prometheus::pop too ?" [puppet] - 10https://gerrit.wikimedia.org/r/775296 (owner: 10Muehlenhoff) [15:32:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ores1001.eqiad.wmnet [15:32:20] 10SRE, 10Growth-Team, 10Notifications, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T304927 (10herron) 05Open→03Resolved a:03herron Thanks for the report @kostajh yes this has been addressed and an acknowledgement has be... [15:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P23842 and previous config saved to /var/cache/conftool/dbconfig/20220330-153621-marostegui.json [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:40] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) [15:42:14] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) I have confirmed that being in the `deployment` group will allow sudo to `www-data`... [15:43:01] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab-runner1001.eqiad.wmnet [15:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23843 and previous config saved to /var/cache/conftool/dbconfig/20220330-154326-ladsgroup.json [15:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:14] 10SRE, 10ops-codfw, 10Traffic: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10MMandere) 05Open→03Invalid The problem later resolved on Icinga as the check succeeded, after the reimage of the instance was complete. [15:45:22] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:43] (03CR) 10BBlack: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/775251 (owner: 10Filippo Giunchedi) [15:45:46] ^ thats me + a.rnold and expected [15:46:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: pin ats-tls targets to class not cluster [puppet] - 10https://gerrit.wikimedia.org/r/775251 (owner: 10Filippo Giunchedi) [15:46:51] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab-runner1001.eqiad.wmnet [15:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:43] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab-runner2001.codfw.wmnet [15:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:51:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T297189)', diff saved to https://phabricator.wikimedia.org/P23844 and previous config saved to /var/cache/conftool/dbconfig/20220330-155126-marostegui.json [15:51:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:51:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:51:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:51:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab-runner2001.codfw.wmnet [15:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:36] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [15:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T297189)', diff saved to https://phabricator.wikimedia.org/P23845 and previous config saved to /var/cache/conftool/dbconfig/20220330-155139-marostegui.json [15:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:52:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:25] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab1001.wikimedia.org [15:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:36] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:56:43] ^ "expected" [15:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23847 and previous config saved to /var/cache/conftool/dbconfig/20220330-155832-ladsgroup.json [15:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:48] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:00:22] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab1001.wikimedia.org [16:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:12] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:20] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Lydia_Pintscher) p:05Low→03High Changing the priority to high based on today's discussion in the query service sync. This is be... [16:05:22] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:05:54] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Lydia_Pintscher) [16:07:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Pybal, and 2 others: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10Lydia_Pintscher) [16:07:32] 10SRE-OnFire, 10Discovery-Search, 10Wikidata, 10wdwb-tech, and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Lydia_Pintscher) [16:10:22] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:53] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10TomekSikora.Monsoon) 05Invalid→03Open I have received this information from a member of your team: 1. It doesn't need an ssh access or ssh keys unless you're going to need private data 2.... [16:13:16] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23848 and previous config saved to /var/cache/conftool/dbconfig/20220330-161337-ladsgroup.json [16:13:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:13:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [16:13:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:13:47] (03PS1) 10Elukey: Apply the istio sidecar/mesh settings to the ml-serve configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/775343 (https://phabricator.wikimedia.org/T297612) [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:49] (03PS1) 10Elukey: knative-serving: refactor support for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/775344 (https://phabricator.wikimedia.org/T297612) [16:13:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [16:13:51] (03PS1) 10Elukey: Move ml-serve pod configs to Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/775345 (https://phabricator.wikimedia.org/T297612) [16:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:13:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:14:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:14:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:14:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23849 and previous config saved to /var/cache/conftool/dbconfig/20220330-161418-ladsgroup.json [16:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23850 and previous config saved to /var/cache/conftool/dbconfig/20220330-161626-ladsgroup.json [16:16:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [16:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) [16:21:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [16:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:36] (03CR) 10Klausman: [C: 03+1] Apply the istio sidecar/mesh settings to the ml-serve configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/775343 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [16:24:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [16:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:09] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10RhinosF1) You've filled out the wrong form for LDAP access [16:26:27] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10RhinosF1) Please see https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Grant%20Access%20to%20%3CINSERT%20LDAP%20GROUP%3E%20for%20%3CINSERT%20USERNAME%3E&description=*%20The%2... [16:26:45] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10RhinosF1) 05Open→03Stalled [16:28:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [16:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:29] 10SRE, 10ops-codfw: Dell switches testing: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Aklapper) [16:30:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:30:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23852 and previous config saved to /var/cache/conftool/dbconfig/20220330-163132-ladsgroup.json [16:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T297189)', diff saved to https://phabricator.wikimedia.org/P23853 and previous config saved to /var/cache/conftool/dbconfig/20220330-163217-marostegui.json [16:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:23] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [16:40:27] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock the testing of a new cookbook, I'll amend any post-merge comment in a follow up patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/775332 (owner: 10Volans) [16:40:49] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging to unblock the testing of the new cookbook, I'll amend any post-merge comment in a follow up patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/775333 (owner: 10Volans) [16:43:18] (03Merged) 10jenkins-bot: cookbooks.sre: SREBatchRunnerBase early assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/775332 (owner: 10Volans) [16:44:10] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: typo on CLI argument [cookbooks] - 10https://gerrit.wikimedia.org/r/775333 (owner: 10Volans) [16:46:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23854 and previous config saved to /var/cache/conftool/dbconfig/20220330-164637-ladsgroup.json [16:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P23855 and previous config saved to /var/cache/conftool/dbconfig/20220330-164722-marostegui.json [16:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) For esams failover testing: we're planning to attempt this on Thursday. The idea is to merge the oustanding patches and then depool esa... [16:52:27] !log sudo systemctl reload icinga.service on alert1001 [16:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:40] !log "Manually decommissioning xe-0/0/1 on lsw1-e2-eqiad before reimage of ms-be1069 from scratch, attempt to replicate ARP error seen previously while running debug." [16:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:13] (03PS1) 10Jcrespo: Add functionality to "archiving" older status of a file [software/mediabackups] - 10https://gerrit.wikimedia.org/r/775354 [16:59:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298557)', diff saved to https://phabricator.wikimedia.org/P23856 and previous config saved to /var/cache/conftool/dbconfig/20220330-165903-marostegui.json [16:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:10] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [17:01:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23857 and previous config saved to /var/cache/conftool/dbconfig/20220330-170142-ladsgroup.json [17:01:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [17:01:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [17:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23858 and previous config saved to /var/cache/conftool/dbconfig/20220330-170150-ladsgroup.json [17:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:01] (03PS1) 10Volans: cookbooks.sre: SREBatchRunnerBase fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/775355 [17:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P23859 and previous config saved to /var/cache/conftool/dbconfig/20220330-170227-marostegui.json [17:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:07:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:52] (03CR) 10Volans: [C: 03+2] "Self merging to unblock testing, trivial bug." [cookbooks] - 10https://gerrit.wikimedia.org/r/775355 (owner: 10Volans) [17:11:54] (03Merged) 10jenkins-bot: cookbooks.sre: SREBatchRunnerBase fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/775355 (owner: 10Volans) [17:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23861 and previous config saved to /var/cache/conftool/dbconfig/20220330-171259-ladsgroup.json [17:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23862 and previous config saved to /var/cache/conftool/dbconfig/20220330-171408-marostegui.json [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T297189)', diff saved to https://phabricator.wikimedia.org/P23864 and previous config saved to /var/cache/conftool/dbconfig/20220330-171732-marostegui.json [17:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:39] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [17:18:19] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Emailed our Dell team with our issues, will update as they respond. > Dell Team, > > We're currently attempting to get the new raid controllers for function for us so we can unblock and order a... [17:18:28] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) 05In progress→03Open [17:21:55] (03CR) 10Ahmon Dancy: [C: 03+1] scap: make rsync use new compress algorithm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774824 (https://phabricator.wikimedia.org/T252540) (owner: 10Hashar) [17:24:12] PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23865 and previous config saved to /var/cache/conftool/dbconfig/20220330-172804-ladsgroup.json [17:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23866 and previous config saved to /var/cache/conftool/dbconfig/20220330-172913-marostegui.json [17:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:09] (03PS1) 10RLazarus: external_clouds_vendors: Add Linode [puppet] - 10https://gerrit.wikimedia.org/r/775360 (https://phabricator.wikimedia.org/T270391) [17:38:57] * addshore goes to debug something on mwdebug2001 [17:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23867 and previous config saved to /var/cache/conftool/dbconfig/20220330-174309-ladsgroup.json [17:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298557)', diff saved to https://phabricator.wikimedia.org/P23868 and previous config saved to /var/cache/conftool/dbconfig/20220330-174418-marostegui.json [17:44:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:44:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:24] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [17:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298557)', diff saved to https://phabricator.wikimedia.org/P23869 and previous config saved to /var/cache/conftool/dbconfig/20220330-174426-marostegui.json [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:46:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:46:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [17:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [17:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1069.eqiad.wmnet with OS stretch [17:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch [17:55:27] Amir1: im guessing your doing something to make testwiki readonly? :D (If so any chance of a poke when done?) [17:55:44] I don't [17:55:55] and if testwiki is read-only it probably means all of s3 is read-only [17:56:09] https://usercontent.irccloud-cdn.com/file/wyDSJHSh/image.png [17:56:12] harmmm [17:56:18] *tries another page* [17:56:42] oh its gone now! [17:56:58] aah [17:57:00] sorry for the noise, thought it might have been something to do with your depools above [17:58:08] oh im an idiot, i was pointing at mwdebug2001 [17:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23870 and previous config saved to /var/cache/conftool/dbconfig/20220330-175814-ladsgroup.json [17:58:15] actually in logs, enwiki is read only a lot [17:58:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:58:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23871 and previous config saved to /var/cache/conftool/dbconfig/20220330-175822-ladsgroup.json [17:58:24] like 100k times [17:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:47] oh it's all codfw [17:58:49] whatever [17:59:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23872 and previous config saved to /var/cache/conftool/dbconfig/20220330-175930-ladsgroup.json [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] addshore: these days we basically depool/repool a host all the time, look at SAL, we basically made it useless [17:59:46] xD [18:00:05] hashar and jeena: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T1800). [18:00:05] hashar and jeena: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T1800). [18:00:19] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [18:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:38] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [18:00:40] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host zookeeper-test1002.eqiad.wmnet [18:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:18] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [18:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:42] (03CR) 10Volans: [C: 03+1] "I didn't test it but the change looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/775360 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [18:03:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1069.eqiad.wmnet with reason: host reimage [18:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:44] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [18:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:11:07] !log razzi@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test-eqiad cluster: Reboot kafka nodes [18:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:13] (03PS1) 10Andrew Bogott: openstack::serverpackages::wallaby::bullseye: install python3-eventlet from bpo [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) [18:14:25] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: fix typo and SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775366 [18:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23873 and previous config saved to /var/cache/conftool/dbconfig/20220330-181435-ladsgroup.json [18:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:13] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:29] (03CR) 10Volans: [C: 03+2] "Self merging to unblock testing, trivial bug/improvement." [cookbooks] - 10https://gerrit.wikimedia.org/r/775366 (owner: 10Volans) [18:19:49] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:18] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: fix typo and SAL log [cookbooks] - 10https://gerrit.wikimedia.org/r/775366 (owner: 10Volans) [18:22:25] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:23:25] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:25:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:25:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [18:25:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23874 and previous config saved to /var/cache/conftool/dbconfig/20220330-182537-ladsgroup.json [18:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:25:53] (03PS1) 10Sergio Gimeno: Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775370 (https://phabricator.wikimedia.org/T305057) [18:27:09] (03PS1) 10Sergio Gimeno: Newcomer tasks: always align button and text to the right [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775371 (https://phabricator.wikimedia.org/T301825) [18:29:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23875 and previous config saved to /var/cache/conftool/dbconfig/20220330-182940-ladsgroup.json [18:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:27] (03PS1) 10Ottomata: Finalize WikipediaPortal eventlogging event platform migration [puppet] - 10https://gerrit.wikimedia.org/r/775374 (https://phabricator.wikimedia.org/T282012) [18:32:55] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:38:24] (03CR) 10Ottomata: [C: 03+2] Finalize WikipediaPortal eventlogging event platform migration [puppet] - 10https://gerrit.wikimedia.org/r/775374 (https://phabricator.wikimedia.org/T282012) (owner: 10Ottomata) [18:38:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23876 and previous config saved to /var/cache/conftool/dbconfig/20220330-183832-ladsgroup.json [18:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:44:03] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23877 and previous config saved to /var/cache/conftool/dbconfig/20220330-184445-ladsgroup.json [18:44:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:44:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:44:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:44:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [18:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23878 and previous config saved to /var/cache/conftool/dbconfig/20220330-184458-ladsgroup.json [18:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:26] (03CR) 10jerkins-bot: [V: 04-1] Post-edit dialog: check for presence of preferences.topicFilters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775370 (https://phabricator.wikimedia.org/T305057) (owner: 10Sergio Gimeno) [18:53:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23879 and previous config saved to /var/cache/conftool/dbconfig/20220330-185337-ladsgroup.json [18:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:15] (03PS2) 10Vivian Rook: openstack::serverpackages::wallaby::bullseye: install python3-eventlet from bpo-nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [18:56:07] (03CR) 10jerkins-bot: [V: 04-1] openstack::serverpackages::wallaby::bullseye: install python3-eventlet from bpo-nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [18:57:23] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp5009 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775325 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:57:48] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp4023 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775324 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:58:12] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp3057 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775323 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:58:36] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp4029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775322 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:58:47] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp3056 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775321 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:59:00] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6016 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775326 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:59:12] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp5003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775327 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [18:59:27] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp6008 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775328 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [19:04:44] (03PS3) 10Vivian Rook: openstack::serverpackages::wallaby::bullseye: python3-eventlet from nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [19:07:59] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:03] (03PS1) 10Cwhite: logstash: add pipeline diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) [19:08:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P23880 and previous config saved to /var/cache/conftool/dbconfig/20220330-190842-ladsgroup.json [19:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:13] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10herron) [19:17:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298557)', diff saved to https://phabricator.wikimedia.org/P23881 and previous config saved to /var/cache/conftool/dbconfig/20220330-191713-marostegui.json [19:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:20] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [19:19:18] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10herron) 05Stalled→03Open >>! In T304502#7819023, @TomekSikora.Monsoon wrote: > I have received this information from a member of your team: > 1. It doesn't need an ssh a... [19:19:33] (03CR) 10Vivian Rook: [C: 03+1] openstack::serverpackages::wallaby::bullseye: python3-eventlet from nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [19:21:40] 10SRE, 10LDAP-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10herron) [19:23:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P23882 and previous config saved to /var/cache/conftool/dbconfig/20220330-192347-ladsgroup.json [19:23:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:23:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:23:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23883 and previous config saved to /var/cache/conftool/dbconfig/20220330-192355-ladsgroup.json [19:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:02] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.5) - 10https://gerrit.wikimedia.org/r/775370 (https://phabricator.wikimedia.org/T305057) (owner: 10Sergio Gimeno) [19:32:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P23884 and previous config saved to /var/cache/conftool/dbconfig/20220330-193218-marostegui.json [19:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:57] PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:06] (03CR) 10Herron: [C: 03+1] Enable profile::auto_restarts::service for prometheus-blackbox-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775288 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:45:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23885 and previous config saved to /var/cache/conftool/dbconfig/20220330-194512-ladsgroup.json [19:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:47:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P23886 and previous config saved to /var/cache/conftool/dbconfig/20220330-194723-marostegui.json [19:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:25] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:55] !log razzi@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test-eqiad cluster: Reboot kafka nodes [19:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:41] (03CR) 10Andrew Bogott: [C: 03+1] openstack::serverpackages::wallaby::bullseye: python3-eventlet from nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [20:00:04] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220330T2000). [20:00:05] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23887 and previous config saved to /var/cache/conftool/dbconfig/20220330-200017-ladsgroup.json [20:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:01] I can deploy [20:01:18] I'm here! [20:01:43] (03CR) 10Catrope: [C: 03+2] Add surveys to enwiki on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:02:24] (03Merged) 10jenkins-bot: Add surveys to enwiki on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774948 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:02:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298557)', diff saved to https://phabricator.wikimedia.org/P23888 and previous config saved to /var/cache/conftool/dbconfig/20220330-200229-marostegui.json [20:02:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [20:02:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [20:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:35] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [20:02:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298557)', diff saved to https://phabricator.wikimedia.org/P23889 and previous config saved to /var/cache/conftool/dbconfig/20220330-200236-marostegui.json [20:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:59] JSherman: Alright, it's merged, and since it's a beta labs-only patch, it'll be deployed automatically by a job that runs every 10 mins [20:04:14] Awesome; for future reference, is that merge a thing I should have self serviced on my own time, or was it still best to wait for a backport window? [20:05:10] I'm not 100% sure, but I think it's probably OK to merge that yourself as long as you then also "git pull" it on the deployment server [20:05:46] (and if you don't have ssh access to the deployment server, you probably won't / shouldn't have +2 rights in the config repo either) [20:05:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:05:59] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object-reconstructor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:06:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:06:50] RoanKatttouw: yeah, I'll need to go through deployment training first then. It's on my list to do in the not so distant future. Thanks! [20:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:05] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:10:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23890 and previous config saved to /var/cache/conftool/dbconfig/20220330-201006-ladsgroup.json [20:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:10:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:12:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23891 and previous config saved to /var/cache/conftool/dbconfig/20220330-201522-ladsgroup.json [20:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:43] RoanKattouw: I'm still not seeing my surveys on beta. How can I check if the deployment job ran? [20:25:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23892 and previous config saved to /var/cache/conftool/dbconfig/20220330-202511-ladsgroup.json [20:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:55] RoanKattouw can I add a patch for the current deployment window? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/773966 is PHPCS cleanup [20:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23893 and previous config saved to /var/cache/conftool/dbconfig/20220330-203028-ladsgroup.json [20:30:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:30:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:30:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23894 and previous config saved to /var/cache/conftool/dbconfig/20220330-203035-ladsgroup.json [20:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23895 and previous config saved to /var/cache/conftool/dbconfig/20220330-203243-ladsgroup.json [20:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:11] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P23896 and previous config saved to /var/cache/conftool/dbconfig/20220330-204016-ladsgroup.json [20:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:53] urbanecm: I see you are also on the calendar for this window; any chance you could help me? [20:40:54] > I'm still not seeing my surveys on beta. How can I check if the deployment job ran? [20:41:23] hello JSherman, how can i help? [20:42:20] JSherman: if you can link me the patch, i can check it [20:42:29] i don't think you can check it yourself, unless you have shell access to beta [20:42:38] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/774948 [20:43:04] DannyS712: Yes that's fine, eating lunch now but I'll take care of it when I'm back [20:43:11] urbanecm: I do not. [20:43:48] in that case, let me check [20:45:28] JSherman: i see your patch at the beta MW servers [20:45:44] okay, that means that the config isn't working the way I expect [20:45:49] yes [20:45:52] unfortunately :( [20:46:03] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:45] So, we should probably revert? I can go try to id the problem later. [20:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23897 and previous config saved to /var/cache/conftool/dbconfig/20220330-204748-ladsgroup.json [20:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:56] JSherman: i think we can do both. it's not breaking beta or anything [20:48:09] i can also quickly check if i can find the cause quickly [20:49:00] urbanecm: that would be awesome.  The contents of each survey config are the same as what I have in my local docker. Really the only changes are that they are inside the enwiki config and have named keys for each survey [20:51:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:51:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23898 and previous config saved to /var/cache/conftool/dbconfig/20220330-205521-ladsgroup.json [20:55:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:55:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23899 and previous config saved to /var/cache/conftool/dbconfig/20220330-205529-ladsgroup.json [20:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:50] JSherman: unfortunately, i don't see anything :( [21:02:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23900 and previous config saved to /var/cache/conftool/dbconfig/20220330-210253-ladsgroup.json [21:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:54] urbanecm:I just updated my local config to use those same named keys, and they still work in my local environment [21:03:54] I also verified that the existing survey on beta is working: [21:03:55] https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Selenium_Echo_mention_test_0.22687354168980223&quicksurvey=internal-gdi-safety-survey [21:03:55] I also verified that the new ones are definitely not: [21:03:56] https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Selenium_Echo_mention_test_0.22687354168980223&quicksurvey=T294363-1 [21:03:56] T294363: Link to additional information and resources in the thank you message - https://phabricator.wikimedia.org/T294363 [21:04:52] I have a team member who has deployed way more of these than I have, so I'll ask him for a recheck on my work to see if I've created something that ends up with a problem from the config array merging or something. [21:07:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:31] (03CR) 10Vivian Rook: [C: 03+2] openstack::serverpackages::wallaby::bullseye: python3-eventlet from nochange [puppet] - 10https://gerrit.wikimedia.org/r/775365 (https://phabricator.wikimedia.org/T304694) (owner: 10Andrew Bogott) [21:09:02] urbanecm: thanks for all your help on this! [21:09:46] JSherman: if that helps i dumped the config key at beta and I don't see the new surveys added there [21:10:02] (by dumped i mean viewed it in shell.php) [21:10:08] I can send you the output if you'd like. [21:10:21] gotcha. Yeah, that would be helpful! [21:14:39] urbanecm: I've got to run, but that output would be useful to look at when I come back to this tomorrow! [21:17:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23902 and previous config saved to /var/cache/conftool/dbconfig/20220330-211758-ladsgroup.json [21:18:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:18:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:18:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23903 and previous config saved to /var/cache/conftool/dbconfig/20220330-211806-ladsgroup.json [21:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:26] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:05] JSherman: here you o https://www.irccloud.com/pastebin/84N0IaGP/output.txt [21:19:06] *go [21:21:46] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [21:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:51] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [21:22:29] RECOVERY - Host wdqs2007 is UP: PING OK - Packet loss = 0%, RTA = 32.74 ms [21:22:55] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:38:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [21:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:15:02] (03PS1) 10Andrew Bogott: Update git repo to correspond to the actual running files [wikitech-static] - 10https://gerrit.wikimedia.org/r/775396 [22:15:48] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [22:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23904 and previous config saved to /var/cache/conftool/dbconfig/20220330-221820-ladsgroup.json [22:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:20:11] (03CR) 10RLazarus: [C: 03+1] Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 (owner: 10JMeybohm) [22:20:31] (03PS1) 10Andrew Bogott: import-wikitech.sh: nukeNS.php --ns 8 before import [wikitech-static] - 10https://gerrit.wikimedia.org/r/775397 [22:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298557)', diff saved to https://phabricator.wikimedia.org/P23905 and previous config saved to /var/cache/conftool/dbconfig/20220330-222240-marostegui.json [22:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:48] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [22:23:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23906 and previous config saved to /var/cache/conftool/dbconfig/20220330-222351-ladsgroup.json [22:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:33:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23907 and previous config saved to /var/cache/conftool/dbconfig/20220330-223325-ladsgroup.json [22:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P23908 and previous config saved to /var/cache/conftool/dbconfig/20220330-223745-marostegui.json [22:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23909 and previous config saved to /var/cache/conftool/dbconfig/20220330-223856-ladsgroup.json [22:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:35] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23910 and previous config saved to /var/cache/conftool/dbconfig/20220330-224831-ladsgroup.json [22:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P23911 and previous config saved to /var/cache/conftool/dbconfig/20220330-225250-marostegui.json [22:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P23912 and previous config saved to /var/cache/conftool/dbconfig/20220330-225401-ladsgroup.json [22:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:54] (03CR) 10Krinkle: [C: 03+1] "LGTM. Seems trivial enough." [wikitech-static] - 10https://gerrit.wikimedia.org/r/775396 (owner: 10Andrew Bogott) [22:58:36] (03PS2) 10Krinkle: import-wikitech.sh: nukeNS.php --ns 8 before import [wikitech-static] - 10https://gerrit.wikimedia.org/r/775397 (owner: 10Andrew Bogott) [22:58:39] (03CR) 10Krinkle: [C: 03+1] import-wikitech.sh: nukeNS.php --ns 8 before import [wikitech-static] - 10https://gerrit.wikimedia.org/r/775397 (owner: 10Andrew Bogott) [23:02:40] (03CR) 10Dzahn: icinga/lists: fix double quoted mailman monitoring check commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) (owner: 10Dzahn) [23:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23913 and previous config saved to /var/cache/conftool/dbconfig/20220330-230336-ladsgroup.json [23:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:04:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:04:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23914 and previous config saved to /var/cache/conftool/dbconfig/20220330-230408-ladsgroup.json [23:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:28] (03CR) 10Dzahn: httpbb: follow-up to 'fix status code checks for CodeReview redirects' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774981 (https://phabricator.wikimedia.org/T205361) (owner: 10Dzahn) [23:06:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23915 and previous config saved to /var/cache/conftool/dbconfig/20220330-230615-ladsgroup.json [23:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298557)', diff saved to https://phabricator.wikimedia.org/P23916 and previous config saved to /var/cache/conftool/dbconfig/20220330-230755-marostegui.json [23:07:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:07:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:01] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [23:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298557)', diff saved to https://phabricator.wikimedia.org/P23917 and previous config saved to /var/cache/conftool/dbconfig/20220330-230803-marostegui.json [23:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23918 and previous config saved to /var/cache/conftool/dbconfig/20220330-230905-ladsgroup.json [23:09:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:09:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:09:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23919 and previous config saved to /var/cache/conftool/dbconfig/20220330-230914-ladsgroup.json [23:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23920 and previous config saved to /var/cache/conftool/dbconfig/20220330-232120-ladsgroup.json [23:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23921 and previous config saved to /var/cache/conftool/dbconfig/20220330-233625-ladsgroup.json [23:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:05] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:39:47] (03CR) 10RLazarus: [C: 03+2] Add missing Build-Depends entry [software/httpbb] - 10https://gerrit.wikimedia.org/r/761442 (owner: 10RLazarus) [23:40:58] (03Merged) 10jenkins-bot: Add missing Build-Depends entry [software/httpbb] - 10https://gerrit.wikimedia.org/r/761442 (owner: 10RLazarus) [23:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23922 and previous config saved to /var/cache/conftool/dbconfig/20220330-235131-ladsgroup.json [23:51:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:51:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:51:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:51:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23923 and previous config saved to /var/cache/conftool/dbconfig/20220330-235143-ladsgroup.json [23:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P23924 and previous config saved to /var/cache/conftool/dbconfig/20220330-235311-ladsgroup.json [23:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23925 and previous config saved to /var/cache/conftool/dbconfig/20220330-235351-ladsgroup.json [23:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log