[00:05:28] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:08] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:24] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:13:58] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2064.codfw.wmnet with OS bullseye [00:27:40] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:27:56] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2064.codfw.wmnet with reason: host reimage [00:32:59] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2064.codfw.wmnet with reason: host reimage [00:42:54] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:14] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:46] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:47:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2064.codfw.wmnet with OS bullseye [00:51:57] (03PS1) 10RLazarus: shellbox: Bump replicas to 24 to support request spikes [deployment-charts] - 10https://gerrit.wikimedia.org/r/814267 (https://phabricator.wikimedia.org/T310557) [00:52:40] ^ self-merging on a Friday afternoon, but it's a no-op as explained in the commit message -- just cleaning up the diff between the charts repo and production [00:58:34] (03CR) 10RLazarus: [C: 03+2] shellbox: Bump replicas to 24 to support request spikes [deployment-charts] - 10https://gerrit.wikimedia.org/r/814267 (https://phabricator.wikimedia.org/T310557) (owner: 10RLazarus) [01:02:18] (03Merged) 10jenkins-bot: shellbox: Bump replicas to 24 to support request spikes [deployment-charts] - 10https://gerrit.wikimedia.org/r/814267 (https://phabricator.wikimedia.org/T310557) (owner: 10RLazarus) [01:05:51] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:01] ^ me, stand by [01:07:17] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:46] 👍 [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:17] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:39] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:36:09] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) The Swift dashboard only shows errors reported by Swift, but it was not reporting any errors. It was not aware of its own brokenness. Nginx was seeing connection timeouts, and MediaWiki was seeing the 502s genera... [05:21:57] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 23.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:23:25] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 31.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:23:29] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:25:57] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:26:01] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 73.01 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:26:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 84.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [05:54:16] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) Maybe I should have done a root cause analysis instead of just restarting it. Timeline: * 15:44:20: Spike of "Inactive DC request HTTP error: 503 Service Unavailable" begins in ms-fe1010:server.log. 9111 log ent... [06:30:32] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) [06:31:30] had to check wiktionary to make sure I was using the word "wedge" correctly [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220716T0700) [07:31:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:25] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:15:43] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:05] (03CR) 10Daniel Kinzler: "Is this still needed, or did I63e816edd1a561bda6063f8558ccce88c113df3f fix the issue?" [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [12:01:11] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:29:42] (03CR) 10RhinosF1: beta: [fix ci - until next week] add --skip-config-validation to update.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [13:00:43] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:33] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:43:33] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:43:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:44:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:44:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:44:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31186 and previous config saved to /var/cache/conftool/dbconfig/20220716-134429-ladsgroup.json [13:44:34] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31187 and previous config saved to /var/cache/conftool/dbconfig/20220716-135129-ladsgroup.json [13:51:34] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31188 and previous config saved to /var/cache/conftool/dbconfig/20220716-140634-ladsgroup.json [14:10:57] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:21:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31189 and previous config saved to /var/cache/conftool/dbconfig/20220716-142140-ladsgroup.json [14:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31190 and previous config saved to /var/cache/conftool/dbconfig/20220716-143645-ladsgroup.json [14:36:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:36:51] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:37:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:37:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31191 and previous config saved to /var/cache/conftool/dbconfig/20220716-143705-ladsgroup.json [14:51:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31192 and previous config saved to /var/cache/conftool/dbconfig/20220716-145111-ladsgroup.json [14:51:17] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:05:41] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31193 and previous config saved to /var/cache/conftool/dbconfig/20220716-150616-ladsgroup.json [15:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31194 and previous config saved to /var/cache/conftool/dbconfig/20220716-152122-ladsgroup.json [15:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31195 and previous config saved to /var/cache/conftool/dbconfig/20220716-153627-ladsgroup.json [15:36:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:36:32] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:36:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:36:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31196 and previous config saved to /var/cache/conftool/dbconfig/20220716-153647-ladsgroup.json [15:49:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31197 and previous config saved to /var/cache/conftool/dbconfig/20220716-154903-ladsgroup.json [15:49:09] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:04:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31198 and previous config saved to /var/cache/conftool/dbconfig/20220716-160408-ladsgroup.json [16:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31199 and previous config saved to /var/cache/conftool/dbconfig/20220716-161913-ladsgroup.json [16:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31200 and previous config saved to /var/cache/conftool/dbconfig/20220716-163418-ladsgroup.json [16:34:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:34:23] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:34:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:34:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31201 and previous config saved to /var/cache/conftool/dbconfig/20220716-163449-ladsgroup.json [16:47:57] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:52:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31202 and previous config saved to /var/cache/conftool/dbconfig/20220716-165255-ladsgroup.json [16:53:00] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [16:53:57] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31203 and previous config saved to /var/cache/conftool/dbconfig/20220716-170800-ladsgroup.json [17:08:33] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31204 and previous config saved to /var/cache/conftool/dbconfig/20220716-172305-ladsgroup.json [17:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312984)', diff saved to https://phabricator.wikimedia.org/P31205 and previous config saved to /var/cache/conftool/dbconfig/20220716-173811-ladsgroup.json [17:38:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:38:15] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:38:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1171.eqiad.wmnet with reason: Maintenance [17:49:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [17:49:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [17:49:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T312984)', diff saved to https://phabricator.wikimedia.org/P31207 and previous config saved to /var/cache/conftool/dbconfig/20220716-174959-ladsgroup.json [17:50:02] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:55:21] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:59:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312984)', diff saved to https://phabricator.wikimedia.org/P31208 and previous config saved to /var/cache/conftool/dbconfig/20220716-175912-ladsgroup.json [17:59:17] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:14:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31209 and previous config saved to /var/cache/conftool/dbconfig/20220716-181417-ladsgroup.json [18:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31210 and previous config saved to /var/cache/conftool/dbconfig/20220716-182922-ladsgroup.json [18:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312984)', diff saved to https://phabricator.wikimedia.org/P31211 and previous config saved to /var/cache/conftool/dbconfig/20220716-184428-ladsgroup.json [18:44:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:44:32] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:44:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:44:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T312984)', diff saved to https://phabricator.wikimedia.org/P31212 and previous config saved to /var/cache/conftool/dbconfig/20220716-184459-ladsgroup.json [19:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312984)', diff saved to https://phabricator.wikimedia.org/P31213 and previous config saved to /var/cache/conftool/dbconfig/20220716-192248-ladsgroup.json [19:22:53] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [19:37:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P31214 and previous config saved to /var/cache/conftool/dbconfig/20220716-193753-ladsgroup.json [19:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P31215 and previous config saved to /var/cache/conftool/dbconfig/20220716-195258-ladsgroup.json [20:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T312984)', diff saved to https://phabricator.wikimedia.org/P31216 and previous config saved to /var/cache/conftool/dbconfig/20220716-200803-ladsgroup.json [20:08:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:08:09] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:08:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:20:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:20:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:20:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 10 hosts with reason: Maintenance [20:20:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 10 hosts with reason: Maintenance [20:32:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:32:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1101.eqiad.wmnet with reason: Maintenance [20:32:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31217 and previous config saved to /var/cache/conftool/dbconfig/20220716-203238-ladsgroup.json [20:32:42] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:32:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31218 and previous config saved to /var/cache/conftool/dbconfig/20220716-213253-ladsgroup.json [21:32:58] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:36:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31219 and previous config saved to /var/cache/conftool/dbconfig/20220716-214758-ladsgroup.json [22:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31220 and previous config saved to /var/cache/conftool/dbconfig/20220716-220303-ladsgroup.json [22:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312984)', diff saved to https://phabricator.wikimedia.org/P31221 and previous config saved to /var/cache/conftool/dbconfig/20220716-221808-ladsgroup.json [22:18:14] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [23:52:09] (03PS1) 10Zabe: Pin cu_log actor migration to old schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814305 (https://phabricator.wikimedia.org/T233004)