[00:07:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26112 and previous config saved to /var/cache/conftool/dbconfig/20220422-000708-ladsgroup.json [00:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:07:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:07:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26113 and previous config saved to /var/cache/conftool/dbconfig/20220422-000732-ladsgroup.json [00:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P26114 and previous config saved to /var/cache/conftool/dbconfig/20220422-001418-ladsgroup.json [00:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:21:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26115 and previous config saved to /var/cache/conftool/dbconfig/20220422-002129-ladsgroup.json [00:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:27:05] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P26116 and previous config saved to /var/cache/conftool/dbconfig/20220422-002924-ladsgroup.json [00:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:59] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) >>! In T305568#7872880, @Papaul wrote: > @Eevans yes B is row B , 6 is the rack number and U35 is the position of the server in the rack (row B rack 6 posi... [00:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26117 and previous config saved to /var/cache/conftool/dbconfig/20220422-003634-ladsgroup.json [00:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P26118 and previous config saved to /var/cache/conftool/dbconfig/20220422-004429-ladsgroup.json [00:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:33] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26119 and previous config saved to /var/cache/conftool/dbconfig/20220422-005140-ladsgroup.json [00:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:00] (03PS1) 10Gergő Tisza: GrowthExperiments: Do not use 'facebook' in campaign pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785245 (https://phabricator.wikimedia.org/T303785) [00:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P26120 and previous config saved to /var/cache/conftool/dbconfig/20220422-005934-ladsgroup.json [00:59:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [00:59:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [00:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P26121 and previous config saved to /var/cache/conftool/dbconfig/20220422-005942-ladsgroup.json [00:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:03:41] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) @Eevans since we have only for rows in codfw do think doing [AC] and [BD] with each row having 3 servers in a rack will work or not please advice I will be... [01:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26122 and previous config saved to /var/cache/conftool/dbconfig/20220422-010645-ladsgroup.json [01:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [01:06:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:06:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [01:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [01:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [01:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:45] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Papaul) @lmata please see below for list requested on cumin: - sudo cookbook sre.hosts.provision - sudo cookbook sre.hosts.reimage - sudo coo... [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:46:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:49] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) >>! In T305568#7873122, @Papaul wrote: > @Eevans since we have only for rows in codfw do think doing [AC] and [BD] with each row having 3 servers in a rack... [01:58:31] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:59:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P26123 and previous config saved to /var/cache/conftool/dbconfig/20220422-015957-ladsgroup.json [02:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:13:45] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) @Eevans I do not have any space issue in codfw for now, so I think [A,D], [B], [C] should work without a problem. Now what i will like for you to give me... [02:15:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26124 and previous config saved to /var/cache/conftool/dbconfig/20220422-021502-ladsgroup.json [02:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:24:15] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) 502 again. [02:25:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:25:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26125 and previous config saved to /var/cache/conftool/dbconfig/20220422-022544-ladsgroup.json [02:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P26126 and previous config saved to /var/cache/conftool/dbconfig/20220422-023007-ladsgroup.json [02:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:45:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P26127 and previous config saved to /var/cache/conftool/dbconfig/20220422-024512-ladsgroup.json [02:45:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:45:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [02:45:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [02:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [02:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:47] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:52:12] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:40] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26129 and previous config saved to /var/cache/conftool/dbconfig/20220422-031801-ladsgroup.json [03:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:22:10] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:18] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.138 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26130 and previous config saved to /var/cache/conftool/dbconfig/20220422-033306-ladsgroup.json [03:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:30] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:38:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [03:38:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [03:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:55] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10wiki_willy) Thanks @Papaul. Access for John Clark to run these commands is all approved on my end as well. Thanks, Willy [03:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26131 and previous config saved to /var/cache/conftool/dbconfig/20220422-034811-ladsgroup.json [03:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26132 and previous config saved to /var/cache/conftool/dbconfig/20220422-040316-ladsgroup.json [04:03:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:03:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26133 and previous config saved to /var/cache/conftool/dbconfig/20220422-040325-ladsgroup.json [04:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:41] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Vyhoanganhkiet) a:03Vyhoanganhkiet [04:22:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:23:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:23:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:28:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47965 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:47:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26134 and previous config saved to /var/cache/conftool/dbconfig/20220422-044730-ladsgroup.json [04:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:02:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26135 and previous config saved to /var/cache/conftool/dbconfig/20220422-050235-ladsgroup.json [05:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:07:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:07:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26136 and previous config saved to /var/cache/conftool/dbconfig/20220422-050802-ladsgroup.json [05:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26137 and previous config saved to /var/cache/conftool/dbconfig/20220422-051740-ladsgroup.json [05:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26138 and previous config saved to /var/cache/conftool/dbconfig/20220422-053246-ladsgroup.json [05:32:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:32:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:44] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:05:15] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10AlexisJazz) a:05Vyhoanganhkiet→03None [06:08:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26139 and previous config saved to /var/cache/conftool/dbconfig/20220422-060816-ladsgroup.json [06:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:12:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:12:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26140 and previous config saved to /var/cache/conftool/dbconfig/20220422-061304-ladsgroup.json [06:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26141 and previous config saved to /var/cache/conftool/dbconfig/20220422-062322-ladsgroup.json [06:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P26142 and previous config saved to /var/cache/conftool/dbconfig/20220422-063827-ladsgroup.json [06:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P26143 and previous config saved to /var/cache/conftool/dbconfig/20220422-065332-ladsgroup.json [06:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:59:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26144 and previous config saved to /var/cache/conftool/dbconfig/20220422-065957-ladsgroup.json [07:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220422T0700) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:15:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26145 and previous config saved to /var/cache/conftool/dbconfig/20220422-071502-ladsgroup.json [07:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:01] (03PS1) 10Ayounsi: replace_device: actually save the cable modification [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/785272 (https://phabricator.wikimedia.org/T259166) [07:26:32] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:30:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26146 and previous config saved to /var/cache/conftool/dbconfig/20220422-073007-ladsgroup.json [07:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:26] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:28] (03PS1) 10Ayounsi: Remove support for legacy ELS junos syntax [homer/public] - 10https://gerrit.wikimedia.org/r/785273 [07:35:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26147 and previous config saved to /var/cache/conftool/dbconfig/20220422-074512-ladsgroup.json [07:45:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:45:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26148 and previous config saved to /var/cache/conftool/dbconfig/20220422-074520-ladsgroup.json [07:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:09] (03PS1) 10Ayounsi: Management routers: move ssh port to 2222 [homer/public] - 10https://gerrit.wikimedia.org/r/785274 (https://phabricator.wikimedia.org/T277438) [07:52:05] (03CR) 10Ayounsi: "Diff:" [homer/public] - 10https://gerrit.wikimedia.org/r/785274 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [07:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26149 and previous config saved to /var/cache/conftool/dbconfig/20220422-075903-ladsgroup.json [07:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:02:13] (03CR) 10Phedenskog: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [08:14:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26150 and previous config saved to /var/cache/conftool/dbconfig/20220422-081408-ladsgroup.json [08:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:27] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:27:15] 10SRE-tools, 10Infrastructure-Foundations: Manage DHCP of Ganeti VMs from Netbox - https://phabricator.wikimedia.org/T297133 (10Volans) [08:28:33] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) >>! In T306654#7873125, @Papaul wrote: > on apt.wikimedia.org > - sudo puppet agent This should be replaced by `run-puppet-agent` instea... [08:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26151 and previous config saved to /var/cache/conftool/dbconfig/20220422-082913-ladsgroup.json [08:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:17] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) [08:34:44] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) I've updated the task description according to T306654#7873125. As for the `puppet-merge` on the puppetmasters, does the `datacenter-ops`... [08:35:52] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): WIP: request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) [08:36:19] (03CR) 10Volans: [C: 03+1] "LGTM, lol" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/785272 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [08:44:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26152 and previous config saved to /var/cache/conftool/dbconfig/20220422-084418-ladsgroup.json [08:44:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:44:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:44:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26153 and previous config saved to /var/cache/conftool/dbconfig/20220422-084431-ladsgroup.json [08:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:03:26] PROBLEM - traffic_server backend process restarted on cp2032 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2032&var-layer=backend [09:18:37] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) p:05Triage→03Medium [09:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26154 and previous config saved to /var/cache/conftool/dbconfig/20220422-092503-ladsgroup.json [09:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:32:56] PROBLEM - Varnish frontend child restarted on cp2032 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp2032&var-datasource=codfw+prometheus/ops [09:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26155 and previous config saved to /var/cache/conftool/dbconfig/20220422-094008-ladsgroup.json [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:13] (03PS1) 10Reedy: Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) [09:44:45] (03CR) 10Reedy: [C: 03+2] Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) (owner: 10Reedy) [09:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:54:25] (03PS2) 10Reedy: Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) [09:54:43] (03CR) 10Reedy: [C: 03+2] Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) (owner: 10Reedy) [09:55:07] (03PS3) 10Reedy: Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) [09:55:11] much fail [09:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26156 and previous config saved to /var/cache/conftool/dbconfig/20220422-095513-ladsgroup.json [09:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:20] (03CR) 10Reedy: [C: 03+2] Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) (owner: 10Reedy) [10:10:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26157 and previous config saved to /var/cache/conftool/dbconfig/20220422-101018-ladsgroup.json [10:10:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:10:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26158 and previous config saved to /var/cache/conftool/dbconfig/20220422-101026-ladsgroup.json [10:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:39] (03PS1) 10Cathal Mooney: Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 [10:15:08] (03Merged) 10jenkins-bot: Unbreak Transcoding [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785209 (https://phabricator.wikimedia.org/T306697) (owner: 10Reedy) [10:16:10] (03PS2) 10Cathal Mooney: Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 [10:17:05] !log reedy@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/TimedMediaHandler/: T306697 (duration: 00m 50s) [10:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:09] T306697: Videos scalers cannot create jobs: Failed creating job from description - https://phabricator.wikimedia.org/T306697 [10:22:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:22:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:22:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:47:34] PROBLEM - SSH on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:47:48] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:48:18] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:01:58] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 4.281 second response time https://wikitech.wikimedia.org/wiki/Swift [11:03:26] RECOVERY - SSH on ms-fe1012 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:03:40] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift [11:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26159 and previous config saved to /var/cache/conftool/dbconfig/20220422-111041-ladsgroup.json [11:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26160 and previous config saved to /var/cache/conftool/dbconfig/20220422-112546-ladsgroup.json [11:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:16] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:11] (03CR) 10Ayounsi: "At first glance the concept looks good to me! I didn't look in details yet, but I don't want to block you in case I don't have time for a " [homer/public] - 10https://gerrit.wikimedia.org/r/785284 (owner: 10Cathal Mooney) [11:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26161 and previous config saved to /var/cache/conftool/dbconfig/20220422-114051-ladsgroup.json [11:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26162 and previous config saved to /var/cache/conftool/dbconfig/20220422-115556-ladsgroup.json [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:56:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:56:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26163 and previous config saved to /var/cache/conftool/dbconfig/20220422-115626-ladsgroup.json [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26164 and previous config saved to /var/cache/conftool/dbconfig/20220422-120924-ladsgroup.json [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:24:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26165 and previous config saved to /var/cache/conftool/dbconfig/20220422-122429-ladsgroup.json [12:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26166 and previous config saved to /var/cache/conftool/dbconfig/20220422-123934-ladsgroup.json [12:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26167 and previous config saved to /var/cache/conftool/dbconfig/20220422-125439-ladsgroup.json [12:54:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:54:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26168 and previous config saved to /var/cache/conftool/dbconfig/20220422-125447-ladsgroup.json [12:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:08:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26169 and previous config saved to /var/cache/conftool/dbconfig/20220422-130810-ladsgroup.json [13:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:19:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:19:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:21:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26170 and previous config saved to /var/cache/conftool/dbconfig/20220422-132315-ladsgroup.json [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26171 and previous config saved to /var/cache/conftool/dbconfig/20220422-133820-ladsgroup.json [13:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:28] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:53:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26172 and previous config saved to /var/cache/conftool/dbconfig/20220422-135326-ladsgroup.json [13:53:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:53:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26173 and previous config saved to /var/cache/conftool/dbconfig/20220422-135334-ladsgroup.json [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:23] !log removing all old user_email_token_expires rows in zhwiki [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:34:00] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26174 and previous config saved to /var/cache/conftool/dbconfig/20220422-143846-ladsgroup.json [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:50:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:50:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26175 and previous config saved to /var/cache/conftool/dbconfig/20220422-145351-ladsgroup.json [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:08:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:08:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:35] (03PS1) 10Krinkle: multiversion: Simplify code and improve documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 [15:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P26176 and previous config saved to /var/cache/conftool/dbconfig/20220422-150836-ladsgroup.json [15:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:41] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26177 and previous config saved to /var/cache/conftool/dbconfig/20220422-150856-ladsgroup.json [15:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P26178 and previous config saved to /var/cache/conftool/dbconfig/20220422-151053-ladsgroup.json [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26179 and previous config saved to /var/cache/conftool/dbconfig/20220422-152401-ladsgroup.json [15:24:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:24:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26180 and previous config saved to /var/cache/conftool/dbconfig/20220422-152559-ladsgroup.json [15:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26181 and previous config saved to /var/cache/conftool/dbconfig/20220422-154104-ladsgroup.json [15:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:58] !log cleaning up all of old email tokens in s2 [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T306560)', diff saved to https://phabricator.wikimedia.org/P26182 and previous config saved to /var/cache/conftool/dbconfig/20220422-155609-ladsgroup.json [15:56:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:56:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:14] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26183 and previous config saved to /var/cache/conftool/dbconfig/20220422-155617-ladsgroup.json [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26184 and previous config saved to /var/cache/conftool/dbconfig/20220422-155835-ladsgroup.json [15:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:03:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:03:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26185 and previous config saved to /var/cache/conftool/dbconfig/20220422-160342-ladsgroup.json [16:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:06:20] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26186 and previous config saved to /var/cache/conftool/dbconfig/20220422-161340-ladsgroup.json [16:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:33] (03CR) 10Paladox: tlsproxy::localssl: allow setting keepalive_requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570612 (https://phabricator.wikimedia.org/T241145) (owner: 10Ema) [16:28:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26187 and previous config saved to /var/cache/conftool/dbconfig/20220422-162845-ladsgroup.json [16:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26188 and previous config saved to /var/cache/conftool/dbconfig/20220422-164350-ladsgroup.json [16:43:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:43:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:56] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26189 and previous config saved to /var/cache/conftool/dbconfig/20220422-164359-ladsgroup.json [16:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26190 and previous config saved to /var/cache/conftool/dbconfig/20220422-164507-ladsgroup.json [16:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26191 and previous config saved to /var/cache/conftool/dbconfig/20220422-164717-ladsgroup.json [16:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26192 and previous config saved to /var/cache/conftool/dbconfig/20220422-170012-ladsgroup.json [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:56] (03PS2) 10Krinkle: static: Remove `/static/current` symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) [17:01:03] (03CR) 10Krinkle: [C: 03+2] static: Remove `/static/current` symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [17:01:28] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:43] (03Merged) 10jenkins-bot: static: Remove `/static/current` symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [17:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26193 and previous config saved to /var/cache/conftool/dbconfig/20220422-170222-ladsgroup.json [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:04:41] !log krinkle@deploy1002 Synchronized static/: I5cf2340b3b0358 (duration: 00m 58s) [17:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:05:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:10] (03CR) 10Urbanecm: Increase AbuseFilter's emergency disable threshold for fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) (owner: 10Huji) [17:14:20] (03PS5) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [17:15:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26194 and previous config saved to /var/cache/conftool/dbconfig/20220422-171517-ladsgroup.json [17:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26195 and previous config saved to /var/cache/conftool/dbconfig/20220422-171727-ladsgroup.json [17:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:34] (03CR) 10Cwhite: [C: 03+1] thanos: aggregate exporter 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/784635 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [17:19:58] (03CR) 10Cwhite: [C: 03+1] prometheus: remove per-exporter up checks [puppet] - 10https://gerrit.wikimedia.org/r/784636 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [17:30:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26196 and previous config saved to /var/cache/conftool/dbconfig/20220422-173022-ladsgroup.json [17:30:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:30:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26197 and previous config saved to /var/cache/conftool/dbconfig/20220422-173031-ladsgroup.json [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26198 and previous config saved to /var/cache/conftool/dbconfig/20220422-173234-ladsgroup.json [17:32:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:32:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:38] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26199 and previous config saved to /var/cache/conftool/dbconfig/20220422-173242-ladsgroup.json [17:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:21:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26200 and previous config saved to /var/cache/conftool/dbconfig/20220422-182116-ladsgroup.json [18:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:27:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:32:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26201 and previous config saved to /var/cache/conftool/dbconfig/20220422-183256-ladsgroup.json [18:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:36:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26202 and previous config saved to /var/cache/conftool/dbconfig/20220422-183621-ladsgroup.json [18:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26203 and previous config saved to /var/cache/conftool/dbconfig/20220422-184801-ladsgroup.json [18:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26204 and previous config saved to /var/cache/conftool/dbconfig/20220422-185126-ladsgroup.json [18:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:03:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26205 and previous config saved to /var/cache/conftool/dbconfig/20220422-190306-ladsgroup.json [19:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:06:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26206 and previous config saved to /var/cache/conftool/dbconfig/20220422-190632-ladsgroup.json [19:06:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [19:06:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26207 and previous config saved to /var/cache/conftool/dbconfig/20220422-191812-ladsgroup.json [19:18:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:18:17] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P26208 and previous config saved to /var/cache/conftool/dbconfig/20220422-191820-ladsgroup.json [19:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T306560)', diff saved to https://phabricator.wikimedia.org/P26209 and previous config saved to /var/cache/conftool/dbconfig/20220422-191935-ladsgroup.json [19:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:50:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [19:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [19:50:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:50:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [19:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [19:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:45] (JobUnavailable) firing: (6) Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:51:20] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [19:51:42] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [19:54:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:45] (JobUnavailable) firing: (8) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:59:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:10] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 42302 bytes in 0.113 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:04:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:04:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:04] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.040 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:05:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:05:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [20:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [20:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:05:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:45] (JobUnavailable) firing: (8) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [20:06:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [20:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:04] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01363 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P26210 and previous config saved to /var/cache/conftool/dbconfig/20220422-200605-ladsgroup.json [20:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:10] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:07:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P26211 and previous config saved to /var/cache/conftool/dbconfig/20220422-201023-ladsgroup.json [20:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26212 and previous config saved to /var/cache/conftool/dbconfig/20220422-202528-ladsgroup.json [20:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:26:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [20:28:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [20:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26213 and previous config saved to /var/cache/conftool/dbconfig/20220422-202903-ladsgroup.json [20:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:29:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:12] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004717 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:34:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:06] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:39:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26214 and previous config saved to /var/cache/conftool/dbconfig/20220422-204033-ladsgroup.json [20:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:06] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:41:14] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.552 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:42:12] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:44:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:44:24] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.881 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:45:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:45:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26215 and previous config saved to /var/cache/conftool/dbconfig/20220422-204547-ladsgroup.json [20:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:47:50] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.999 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:49:02] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:50:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26216 and previous config saved to /var/cache/conftool/dbconfig/20220422-205053-ladsgroup.json [20:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:51:12] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:54:36] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T306560)', diff saved to https://phabricator.wikimedia.org/P26217 and previous config saved to /var/cache/conftool/dbconfig/20220422-205538-ladsgroup.json [20:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:59:00] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.883 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:59:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:30] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:00:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:01:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:02:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:22] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:03:54] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.472 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:04:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:05:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47966 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26218 and previous config saved to /var/cache/conftool/dbconfig/20220422-210559-ladsgroup.json [21:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:38] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:08:52] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:08:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:10:58] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:12:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:00] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:16:32] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:16:50] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:17:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:18:10] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.341 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:19:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:20:58] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.399 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26219 and previous config saved to /var/cache/conftool/dbconfig/20220422-212104-ladsgroup.json [21:21:06] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:24:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:58] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:28:06] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:29:04] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:29:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:30:20] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.899 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:31:08] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:31:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:06] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26220 and previous config saved to /var/cache/conftool/dbconfig/20220422-213609-ladsgroup.json [21:36:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:36:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:14] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.855 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:36:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26221 and previous config saved to /var/cache/conftool/dbconfig/20220422-213617-ladsgroup.json [21:36:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26222 and previous config saved to /var/cache/conftool/dbconfig/20220422-213648-ladsgroup.json [21:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:14] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.843 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:39:15] (03PS1) 10Stang: Add tothemoon.ser.asu.edu to the wgCopyUploadsDomains allowlist of commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785326 (https://phabricator.wikimedia.org/T306671) [21:40:18] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:41:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:58] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:42:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.095 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:43:42] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:46:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:50:46] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:51:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:51:20] (03CR) 10Stang: "This patch said "to revise on 2020-10-12", so is it still needed for such restriction at present? Sorry that I don't have the access to th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631930 (https://phabricator.wikimedia.org/T264489) (owner: 10Urbanecm) [21:51:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26223 and previous config saved to /var/cache/conftool/dbconfig/20220422-215153-ladsgroup.json [21:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:19] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) >>! In T305568#7873144, @Papaul wrote: > @Eevans I do not have any space issue in codfw for now, so I think [A,D], [B], [C] should work without a problem... [21:58:40] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:59:40] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:03:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:50] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:05:22] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26224 and previous config saved to /var/cache/conftool/dbconfig/20220422-220658-ladsgroup.json [22:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:08:58] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:12:20] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:16:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:08] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:21:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26225 and previous config saved to /var/cache/conftool/dbconfig/20220422-222203-ladsgroup.json [22:22:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:22:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:56] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:27:36] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:28:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:28:42] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:33:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:34:20] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:36:30] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.342 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26226 and previous config saved to /var/cache/conftool/dbconfig/20220422-223631-ladsgroup.json [22:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:36:40] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:42:10] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:42:22] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:43:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:08] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:48:32] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:50:42] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:51:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26227 and previous config saved to /var/cache/conftool/dbconfig/20220422-225136-ladsgroup.json [22:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:38] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:55:32] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:56:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:57:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26228 and previous config saved to /var/cache/conftool/dbconfig/20220422-225735-ladsgroup.json [22:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:58:08] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:59:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:06] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.225 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:00:56] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:03:56] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:04:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:44] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26229 and previous config saved to /var/cache/conftool/dbconfig/20220422-230642-ladsgroup.json [23:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:22] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:09:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:14:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:16:22] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:18:30] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:19:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26230 and previous config saved to /var/cache/conftool/dbconfig/20220422-232147-ladsgroup.json [23:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:22:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:22:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26231 and previous config saved to /var/cache/conftool/dbconfig/20220422-232210-ladsgroup.json [23:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:12] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:16] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:24:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:25:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:28:04] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:30:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:32:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:32:42] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.851 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:33:22] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:37:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26232 and previous config saved to /var/cache/conftool/dbconfig/20220422-233829-ladsgroup.json [23:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:42:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:42:40] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:44:48] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:47:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:48:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:48:20] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [23:48:52] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [23:51:10] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.620 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:52:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26233 and previous config saved to /var/cache/conftool/dbconfig/20220422-235334-ladsgroup.json [23:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:13] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi we are at full capacity on both groups for cr1 and we have 2 links that we still need to connect 1- link to cr2-eqiad xe-4/3/0 2 - link to Hurricane Electric xe-4/3/1 on the other side cr... [23:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown