[00:00:05] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:38] (03CR) 10Dzahn: "@Chad was there a reason to pick 05:39 or was it just to randomize the times. We are wondering years later at https://phabricator.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [00:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33339 and previous config saved to /var/cache/conftool/dbconfig/20220827-000415-ladsgroup.json [00:04:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [00:04:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [00:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T316186)', diff saved to https://phabricator.wikimedia.org/P33340 and previous config saved to /var/cache/conftool/dbconfig/20220827-000442-ladsgroup.json [00:06:17] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:29] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:11] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T316186)', diff saved to https://phabricator.wikimedia.org/P33341 and previous config saved to /var/cache/conftool/dbconfig/20220827-001006-ladsgroup.json [00:15:59] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:25:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P33342 and previous config saved to /var/cache/conftool/dbconfig/20220827-002513-ladsgroup.json [00:38:32] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [00:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P33343 and previous config saved to /var/cache/conftool/dbconfig/20220827-004019-ladsgroup.json [00:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T316186)', diff saved to https://phabricator.wikimedia.org/P33344 and previous config saved to /var/cache/conftool/dbconfig/20220827-005525-ladsgroup.json [00:55:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [00:55:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [00:55:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [00:55:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [00:55:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T316186)', diff saved to https://phabricator.wikimedia.org/P33345 and previous config saved to /var/cache/conftool/dbconfig/20220827-005555-ladsgroup.json [01:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T316186)', diff saved to https://phabricator.wikimedia.org/P33346 and previous config saved to /var/cache/conftool/dbconfig/20220827-010313-ladsgroup.json [01:18:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P33347 and previous config saved to /var/cache/conftool/dbconfig/20220827-011819-ladsgroup.json [01:27:19] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:33:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P33348 and previous config saved to /var/cache/conftool/dbconfig/20220827-013325-ladsgroup.json [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T316186)', diff saved to https://phabricator.wikimedia.org/P33349 and previous config saved to /var/cache/conftool/dbconfig/20220827-014831-ladsgroup.json [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:47] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:59:55] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:16:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:17:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:17:35] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:27:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:29:01] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:51:01] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:38:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:51:05] 10SRE, 10Wikimedia-Etherpad: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10Legoktm) cc'ing @dzahn and @akosiaris because they handled the last upgrade in February: T300568#7675998. [06:54:29] 10SRE, 10Wikimedia-Etherpad: Upgrade etherpad.wikimedia.org to (more) recent Etherpad version with more rich end-user features - https://phabricator.wikimedia.org/T316421 (10Spinster) While there's some attention here, I'd like to bump {T136744} too - although it's an old ticket, I think it would be really gre... [06:59:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220827T0700) [07:46:09] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:50:51] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:24:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:24:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:29:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:29:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:29:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T316186)', diff saved to https://phabricator.wikimedia.org/P33350 and previous config saved to /var/cache/conftool/dbconfig/20220827-082924-ladsgroup.json [09:05:57] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:29:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T316186)', diff saved to https://phabricator.wikimedia.org/P33351 and previous config saved to /var/cache/conftool/dbconfig/20220827-092940-ladsgroup.json [09:38:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:44:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P33352 and previous config saved to /var/cache/conftool/dbconfig/20220827-094446-ladsgroup.json [09:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P33353 and previous config saved to /var/cache/conftool/dbconfig/20220827-095953-ladsgroup.json [10:01:01] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T316186)', diff saved to https://phabricator.wikimedia.org/P33354 and previous config saved to /var/cache/conftool/dbconfig/20220827-101459-ladsgroup.json [10:15:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:15:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:15:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33355 and previous config saved to /var/cache/conftool/dbconfig/20220827-101523-ladsgroup.json [10:29:49] RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:23] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 2 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [10:33:51] PROBLEM - MariaDB Replica IO: staging on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:17] PROBLEM - MariaDB read only staging on dbstore1005 is CRITICAL: Could not connect to localhost:3350 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:34:41] PROBLEM - MariaDB Replica SQL: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:35:19] PROBLEM - MariaDB Replica Lag: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:35:21] PROBLEM - mysqld processes on dbstore1005 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:36:55] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:35] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:00:27] RECOVERY - mysqld processes on dbstore1005 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:01:19] RECOVERY - MariaDB Replica IO: staging on dbstore1005 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:01:49] RECOVERY - MariaDB read only staging on dbstore1005 is OK: Version 10.4.22-MariaDB, Uptime 81s, read_only: False, event_scheduler: True, 11.67 QPS, connection latency: 0.004845s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:02:09] RECOVERY - MariaDB Replica SQL: staging on dbstore1005 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:02:47] RECOVERY - MariaDB Replica Lag: staging on dbstore1005 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:15:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33356 and previous config saved to /var/cache/conftool/dbconfig/20220827-111540-ladsgroup.json [11:25:15] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P33357 and previous config saved to /var/cache/conftool/dbconfig/20220827-113046-ladsgroup.json [11:39:51] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:45:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P33358 and previous config saved to /var/cache/conftool/dbconfig/20220827-114552-ladsgroup.json [11:58:16] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Draco_flavus) Editing on wikisource has some peculiarities., the most wikipedia-users are not familiar with. We work in two steps: 1. preparing t... [12:00:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T316186)', diff saved to https://phabricator.wikimedia.org/P33359 and previous config saved to /var/cache/conftool/dbconfig/20220827-120059-ladsgroup.json [12:01:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:01:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:05:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:06:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:11:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:11:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:11:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T316186)', diff saved to https://phabricator.wikimedia.org/P33360 and previous config saved to /var/cache/conftool/dbconfig/20220827-121121-ladsgroup.json [12:19:35] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:56:33] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T316186)', diff saved to https://phabricator.wikimedia.org/P33361 and previous config saved to /var/cache/conftool/dbconfig/20220827-131136-ladsgroup.json [13:26:41] (03PS1) 10Urbanecm: cswiki: fix extendedconfirmed permission for bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826955 [13:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P33362 and previous config saved to /var/cache/conftool/dbconfig/20220827-132643-ladsgroup.json [13:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P33363 and previous config saved to /var/cache/conftool/dbconfig/20220827-134149-ladsgroup.json [13:42:03] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:51:29] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:56:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T316186)', diff saved to https://phabricator.wikimedia.org/P33364 and previous config saved to /var/cache/conftool/dbconfig/20220827-135655-ladsgroup.json [13:56:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:57:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T316186)', diff saved to https://phabricator.wikimedia.org/P33365 and previous config saved to /var/cache/conftool/dbconfig/20220827-135719-ladsgroup.json [13:57:47] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:41] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T316186)', diff saved to https://phabricator.wikimedia.org/P33366 and previous config saved to /var/cache/conftool/dbconfig/20220827-140642-ladsgroup.json [14:07:45] RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:51] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P33367 and previous config saved to /var/cache/conftool/dbconfig/20220827-142148-ladsgroup.json [14:22:44] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Minorax) [14:24:01] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Minorax) >>! In T244567#7591500, @Minora... [14:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P33368 and previous config saved to /var/cache/conftool/dbconfig/20220827-143654-ladsgroup.json [14:43:03] 10SRE, 10Cloud-VPS, 10Epic, 10cloud-services-team (Kanban): CloudVPS: network architecture - https://phabricator.wikimedia.org/T209460 (10taavi) [14:52:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T316186)', diff saved to https://phabricator.wikimedia.org/P33369 and previous config saved to /var/cache/conftool/dbconfig/20220827-145201-ladsgroup.json [14:52:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:52:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:52:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T316186)', diff saved to https://phabricator.wikimedia.org/P33370 and previous config saved to /var/cache/conftool/dbconfig/20220827-145224-ladsgroup.json [14:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T316186)', diff saved to https://phabricator.wikimedia.org/P33371 and previous config saved to /var/cache/conftool/dbconfig/20220827-145851-ladsgroup.json [15:00:38] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/826957 [15:13:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P33372 and previous config saved to /var/cache/conftool/dbconfig/20220827-151357-ladsgroup.json [15:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P33373 and previous config saved to /var/cache/conftool/dbconfig/20220827-152903-ladsgroup.json [15:44:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T316186)', diff saved to https://phabricator.wikimedia.org/P33374 and previous config saved to /var/cache/conftool/dbconfig/20220827-154410-ladsgroup.json [15:44:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:44:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [15:44:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:44:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T316186)', diff saved to https://phabricator.wikimedia.org/P33375 and previous config saved to /var/cache/conftool/dbconfig/20220827-154452-ladsgroup.json [15:45:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T316186)', diff saved to https://phabricator.wikimedia.org/P33376 and previous config saved to /var/cache/conftool/dbconfig/20220827-155010-ladsgroup.json [16:05:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P33377 and previous config saved to /var/cache/conftool/dbconfig/20220827-160516-ladsgroup.json [16:08:15] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:09:03] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:15] (03PS1) 10Urbanecm: Revert "testwiki: Growth: Assign enrollasmentor to *" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826622 (https://phabricator.wikimedia.org/T310905) [16:14:24] (03PS2) 10Urbanecm: Revert "testwiki: Growth: Assign enrollasmentor to *" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826622 (https://phabricator.wikimedia.org/T310905) [16:15:09] (03PS1) 10Urbanecm: Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) [16:15:19] (03CR) 10CI reject: [V: 04-1] Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [16:17:04] (03PS2) 10Urbanecm: Revert "[beta] Temporarily allow everyone to enroll as mentor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826623 (https://phabricator.wikimedia.org/T310905) [16:19:27] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [16:20:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P33378 and previous config saved to /var/cache/conftool/dbconfig/20220827-162022-ladsgroup.json [16:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T316186)', diff saved to https://phabricator.wikimedia.org/P33379 and previous config saved to /var/cache/conftool/dbconfig/20220827-163528-ladsgroup.json [16:35:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:35:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:41:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:41:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:41:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T316186)', diff saved to https://phabricator.wikimedia.org/P33380 and previous config saved to /var/cache/conftool/dbconfig/20220827-164156-ladsgroup.json [16:46:41] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T316186)', diff saved to https://phabricator.wikimedia.org/P33381 and previous config saved to /var/cache/conftool/dbconfig/20220827-164721-ladsgroup.json [17:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P33382 and previous config saved to /var/cache/conftool/dbconfig/20220827-170227-ladsgroup.json [17:08:57] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:11:47] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P33383 and previous config saved to /var/cache/conftool/dbconfig/20220827-171734-ladsgroup.json [17:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T316186)', diff saved to https://phabricator.wikimedia.org/P33384 and previous config saved to /var/cache/conftool/dbconfig/20220827-173240-ladsgroup.json [17:32:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:32:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:33:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T316186)', diff saved to https://phabricator.wikimedia.org/P33385 and previous config saved to /var/cache/conftool/dbconfig/20220827-173305-ladsgroup.json [17:38:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T316186)', diff saved to https://phabricator.wikimedia.org/P33386 and previous config saved to /var/cache/conftool/dbconfig/20220827-173824-ladsgroup.json [17:39:15] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P33387 and previous config saved to /var/cache/conftool/dbconfig/20220827-175330-ladsgroup.json [18:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P33388 and previous config saved to /var/cache/conftool/dbconfig/20220827-180836-ladsgroup.json [18:20:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:23:07] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T316186)', diff saved to https://phabricator.wikimedia.org/P33389 and previous config saved to /var/cache/conftool/dbconfig/20220827-182343-ladsgroup.json [18:23:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:24:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [18:24:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33390 and previous config saved to /var/cache/conftool/dbconfig/20220827-182408-ladsgroup.json [18:25:53] RECOVERY - Check systemd state on dse-k8s-worker1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:29:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33391 and previous config saved to /var/cache/conftool/dbconfig/20220827-182931-ladsgroup.json [18:30:17] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:32:31] PROBLEM - Host cp5004 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:31] PROBLEM - Host cp5015 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:57] RECOVERY - Host cp5004 is UP: PING WARNING - Packet loss = 77%, RTA = 310.58 ms [18:32:59] RECOVERY - Host cp5015 is UP: PING OK - Packet loss = 0%, RTA = 312.39 ms [18:33:01] PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:01] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:35:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:36:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:37:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:38:29] RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:41:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:42:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:42:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:29] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:44:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P33392 and previous config saved to /var/cache/conftool/dbconfig/20220827-184438-ladsgroup.json [18:45:35] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P33393 and previous config saved to /var/cache/conftool/dbconfig/20220827-185944-ladsgroup.json [19:04:29] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T316186)', diff saved to https://phabricator.wikimedia.org/P33394 and previous config saved to /var/cache/conftool/dbconfig/20220827-191450-ladsgroup.json [19:14:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:15:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [19:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T316186)', diff saved to https://phabricator.wikimedia.org/P33395 and previous config saved to /var/cache/conftool/dbconfig/20220827-191515-ladsgroup.json [19:20:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T316186)', diff saved to https://phabricator.wikimedia.org/P33396 and previous config saved to /var/cache/conftool/dbconfig/20220827-192040-ladsgroup.json [19:35:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P33397 and previous config saved to /var/cache/conftool/dbconfig/20220827-193546-ladsgroup.json [19:50:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P33398 and previous config saved to /var/cache/conftool/dbconfig/20220827-195053-ladsgroup.json [20:05:51] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T316186)', diff saved to https://phabricator.wikimedia.org/P33399 and previous config saved to /var/cache/conftool/dbconfig/20220827-200559-ladsgroup.json [20:06:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [20:06:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [20:06:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:06:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33400 and previous config saved to /var/cache/conftool/dbconfig/20220827-200639-ladsgroup.json [20:08:25] RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33401 and previous config saved to /var/cache/conftool/dbconfig/20220827-201250-ladsgroup.json [20:15:31] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P33402 and previous config saved to /var/cache/conftool/dbconfig/20220827-202757-ladsgroup.json [20:38:13] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P33403 and previous config saved to /var/cache/conftool/dbconfig/20220827-204303-ladsgroup.json [20:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T316186)', diff saved to https://phabricator.wikimedia.org/P33404 and previous config saved to /var/cache/conftool/dbconfig/20220827-205809-ladsgroup.json [21:52:37] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [22:11:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [22:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T316186)', diff saved to https://phabricator.wikimedia.org/P33405 and previous config saved to /var/cache/conftool/dbconfig/20220827-221118-ladsgroup.json [22:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T316186)', diff saved to https://phabricator.wikimedia.org/P33406 and previous config saved to /var/cache/conftool/dbconfig/20220827-221631-ladsgroup.json [22:17:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2149.codfw.wmnet with reason: Sad disk [22:17:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2149.codfw.wmnet with reason: Sad disk [22:17:09] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:17:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149', diff saved to https://phabricator.wikimedia.org/P33407 and previous config saved to /var/cache/conftool/dbconfig/20220827-221749-ladsgroup.json [22:31:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P33408 and previous config saved to /var/cache/conftool/dbconfig/20220827-223137-ladsgroup.json [22:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P33409 and previous config saved to /var/cache/conftool/dbconfig/20220827-224644-ladsgroup.json [22:53:49] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T316186)', diff saved to https://phabricator.wikimedia.org/P33410 and previous config saved to /var/cache/conftool/dbconfig/20220827-230150-ladsgroup.json [23:01:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:02:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33411 and previous config saved to /var/cache/conftool/dbconfig/20220827-230214-ladsgroup.json [23:03:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33412 and previous config saved to /var/cache/conftool/dbconfig/20220827-230339-ladsgroup.json [23:07:35] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33413 and previous config saved to /var/cache/conftool/dbconfig/20220827-231038-ladsgroup.json [23:20:37] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P33414 and previous config saved to /var/cache/conftool/dbconfig/20220827-232544-ladsgroup.json [23:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P33415 and previous config saved to /var/cache/conftool/dbconfig/20220827-234050-ladsgroup.json [23:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T316186)', diff saved to https://phabricator.wikimedia.org/P33416 and previous config saved to /var/cache/conftool/dbconfig/20220827-235556-ladsgroup.json [23:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33417 and previous config saved to /var/cache/conftool/dbconfig/20220827-235810-ladsgroup.json