[00:01:06] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:06:44] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:09:04] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:23:50] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:28:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:28:26] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:18:30] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:20:41] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:26:24] (03PS1) 10Andrew Bogott: keystone: add restrict_password_auth flag [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) [01:28:52] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:41] PROBLEM - MariaDB Replica Lag: s4 #page on db1143 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1356.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:31:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:14] 👋 looking [01:31:20] (03CR) 10Andrew Bogott: "pcc output: https://puppet-compiler.wmflabs.org/pcc-worker1001/36858/" [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) (owner: 10Andrew Bogott) [01:31:46] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:35:03] rzl: late to the party, need help? [01:35:05] !log rzl@cumin2002 dbctl commit (dc=all): 'Depool db1143', diff saved to https://phabricator.wikimedia.org/P32638 and previous config saved to /var/cache/conftool/dbconfig/20220821-013504-rzl.json [01:35:57] jhathaway: nah, all good -- not sure why replication stopped but the depool should be all we need and I'll open a task for DBAs to follow up [01:36:05] go enjoy your Saturday, thanks though! [01:36:23] rzl: thanks for doing the needful! [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:50] (03CR) 10Legoktm: "I obviously haven't found time to actually test this since my last comment...would either of you two be interested in pushing this forward" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [01:42:58] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:27] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10RLazarus) p:05Triage→03High [01:51:40] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:28] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:34] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:08] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:22:08] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:29:11] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Peachey88) [02:29:26] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Peachey88) [02:31:38] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:30] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:30] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32639 and previous config saved to /var/cache/conftool/dbconfig/20220821-024502-ladsgroup.json [02:45:08] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:45:50] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:56] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32640 and previous config saved to /var/cache/conftool/dbconfig/20220821-030008-ladsgroup.json [03:01:08] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32641 and previous config saved to /var/cache/conftool/dbconfig/20220821-031514-ladsgroup.json [03:15:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:20] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32642 and previous config saved to /var/cache/conftool/dbconfig/20220821-033020-ladsgroup.json [03:30:26] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:31:40] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:44] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:46] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:12] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) Thanks - I will check if this might be a 10.6 thing so please leave it as it is [04:00:00] (03PS1) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [04:00:02] (03PS1) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [04:00:04] (03PS1) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [04:00:06] (03PS1) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [04:00:08] (03PS1) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [04:00:10] (03PS1) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [04:00:12] (03PS1) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [04:00:14] (03PS1) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [04:00:16] (03PS1) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [04:00:18] (03PS1) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [04:00:20] (03PS1) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [04:00:22] (03PS1) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [04:00:24] (03PS1) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [04:00:26] (03PS1) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [04:00:28] (03PS1) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [04:00:30] (03PS1) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [04:01:46] (03CR) 10CI reject: [V: 04-1] OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:03:54] (03CR) 10CI reject: [V: 04-1] Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:10:24] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:11:03] (03CR) 10CI reject: [V: 04-1] Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:13:55] (03CR) 10CI reject: [V: 04-1] Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:14:56] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [04:17:05] (03CR) 10CI reject: [V: 04-1] Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:17:10] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:19:32] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:20:24] (03CR) 10CI reject: [V: 04-1] Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [04:23:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:23:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:23:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:24:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T314041)', diff saved to https://phabricator.wikimedia.org/P32643 and previous config saved to /var/cache/conftool/dbconfig/20220821-042415-ladsgroup.json [04:24:19] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:25:54] (03PS2) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [04:25:56] (03PS2) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [04:25:58] (03PS2) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [04:26:00] (03PS2) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [04:26:02] (03PS2) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [04:26:04] (03PS2) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [04:26:06] (03PS2) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [04:26:08] (03PS2) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [04:26:10] (03PS2) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [04:26:12] (03PS2) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [04:26:14] (03PS2) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [04:26:16] (03PS2) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [04:37:28] (03PS2) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [04:37:30] (03PS2) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [04:37:32] (03PS2) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [04:37:34] (03PS2) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [04:37:36] (03PS3) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [04:37:38] (03PS3) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [04:37:40] (03PS3) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [04:37:42] (03PS3) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [04:37:44] (03PS3) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [04:37:46] (03PS3) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [04:37:48] (03PS3) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [04:37:50] (03PS3) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [04:37:52] (03PS3) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [04:37:54] (03PS3) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [04:37:56] (03PS3) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [04:37:58] (03PS3) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [04:45:39] That last patchset may have caused CI to quit in disgust [04:46:55] (03CR) 10Andrew Bogott: [C: 03+1] rabbit.drain_queue: Don't fail if the queue has no messages [puppet] - 10https://gerrit.wikimedia.org/r/814726 (owner: 10David Caro) [04:48:24] (03CR) 10Andrew Bogott: [C: 03+1] hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [04:55:50] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:59:32] (03PS3) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [04:59:34] (03PS3) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [04:59:36] (03PS3) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [04:59:38] (03PS3) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [04:59:40] (03PS4) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [04:59:42] (03PS4) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [04:59:44] (03PS4) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [04:59:46] (03PS4) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [04:59:48] (03PS4) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [04:59:50] (03PS4) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [04:59:52] (03PS4) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [04:59:54] (03PS4) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [04:59:56] (03PS4) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [04:59:58] (03PS4) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [05:00:00] (03PS4) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [05:00:02] (03PS4) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [05:02:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:05:10] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) It doesn't look like the usual problem, as the host just caught up [05:07:04] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:07:05] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) Nevermind the above comment, the host is still lagged and it does look like the usual stall. [05:07:31] (03PS1) 10Marostegui: db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824849 (https://phabricator.wikimedia.org/T315742) [05:08:58] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:14:18] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/824849 (https://phabricator.wikimedia.org/T315742) (owner: 10Marostegui) [05:17:39] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db1143: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824849 (https://phabricator.wikimedia.org/T315742) (owner: 10Marostegui) [05:24:58] 10SRE, 10DBA, 10Patch-For-Review: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) So there are a some things here that are different from the usual stalls from {T311106} so I am not fully sure if that's the same case, but it looks similar as it only affected the 10.6... [05:39:36] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:49:06] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [05:57:00] (03CR) 10CI reject: [V: 04-1] OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [06:02:33] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:04:46] (03CR) 10CI reject: [V: 04-1] Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [06:07:06] (03CR) 10CI reject: [V: 04-1] Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [06:09:39] (03CR) 10CI reject: [V: 04-1] Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [06:09:47] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:47] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:13:03] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:15:17] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:21] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:44:56] (03CR) 10Majavah: [C: 04-1] "some questions/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) (owner: 10Andrew Bogott) [06:48:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:52:59] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220821T0700) [07:08:29] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:13:35] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:55] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:27] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:01] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:35] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:54:37] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:10:45] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:59] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:11] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:11] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:25] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:51] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:47] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:36:11] PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T314041)', diff saved to https://phabricator.wikimedia.org/P32644 and previous config saved to /var/cache/conftool/dbconfig/20220821-084209-ladsgroup.json [08:42:14] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:42:33] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:23] PROBLEM - Check systemd state on wdqs1014 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:49] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:45:55] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:31] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:47:45] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:47:55] RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:15] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:57:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32645 and previous config saved to /var/cache/conftool/dbconfig/20220821-085716-ladsgroup.json [09:02:29] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:11] RECOVERY - Check systemd state on wdqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P32646 and previous config saved to /var/cache/conftool/dbconfig/20220821-091221-ladsgroup.json [09:18:01] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [09:22:43] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 12 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [09:25:15] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:25:27] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:27:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T314041)', diff saved to https://phabricator.wikimedia.org/P32647 and previous config saved to /var/cache/conftool/dbconfig/20220821-092727-ladsgroup.json [09:27:33] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:19:01] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:21:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:22:39] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:51] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:33:21] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:40:17] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:45:54] I left https://phabricator.wikimedia.org/T315748 for the an-worker megaraid [10:46:17] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:05] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:27] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:05] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:22:37] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:28:03] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:39:19] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:46:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:02:53] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:05:15] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:10:52] PROBLEM - MariaDB Replica Lag: s1 #page on db1132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1423.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:11:24] I will depool it [12:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P32648 and previous config saved to /var/cache/conftool/dbconfig/20220821-121140-root.json [12:11:42] Done [12:11:43] depooling [12:11:49] wow, that was fast [12:11:52] <_joe_> ah already on it I see [12:11:55] <_joe_> lol yeah [12:12:04] I will create a task - it might be the 10.6 issue [12:12:04] * _joe_ back to lunch [12:12:14] Yes, it is [12:12:18] <_joe_> :/ [12:12:24] I depool the rest of 10.6 [12:12:39] <_joe_> Amir1: can we manage in that situation? [12:12:59] yeah, they were depooled for three weeks [12:13:46] I pretend I didn't see any pc db on 10.6 [12:14:45] nah [12:14:47] pc is fine [12:14:59] I resolved the incident so it doesn't alert tomorrow [12:15:36] Amir1: Sound sgood [12:15:51] I am creating the task and doing an initial analysis [12:15:57] I will ping mariadb people tomorrow too [12:16:31] marostegui: are these 10.6.9? It looks like a new issue [12:18:19] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [12:18:27] 10SRE, 10DBA: Replication stopped on db1143 - https://phabricator.wikimedia.org/T315742 (10Marostegui) [12:18:35] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824861 [12:19:33] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) We are depooling all 10.6 hosts again after two issues today with db1132 (s1) (T315754) and db1143 (s4) (T315742) where... [12:19:55] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824861 (owner: 10Marostegui) [12:20:16] Amir1: No, they are 10.6.8 [12:20:36] interesting, we haven't seen this issue before [12:25:31] Amir1: Are you depooling the to other 10.6 hosts running in sX? [12:25:38] I'm on it [12:25:50] Ah ok, if not let me know, I can do it too [12:25:58] nah, enjoy your weekend [12:26:39] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool 10.6 hosts', diff saved to https://phabricator.wikimedia.org/P32649 and previous config saved to /var/cache/conftool/dbconfig/20220821-123038-ladsgroup.json [12:31:03] Done ^ [12:31:25] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:36] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) @Danielgblack Double check T315754 and db1143 T315742 for the latest occurrences if you've got time. The hosts haven't... [12:36:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db[1111,1127,1132].eqiad.wmnet with reason: 10.6 being 10.6 [12:36:16] downtimed them, just in case [12:36:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db[1111,1127,1132].eqiad.wmnet with reason: 10.6 being 10.6 [13:02:51] (03CR) 10Andrew Bogott: keystone: add restrict_password_auth flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824830 (https://phabricator.wikimedia.org/T294195) (owner: 10Andrew Bogott) [13:07:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:09:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:10:18] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:12:16] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:18:44] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:22:42] (03PS1) 10Ori: Set $wgCdnMatchParameterOrder to false by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) [13:28:26] (03PS4) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [13:28:28] (03PS4) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [13:28:30] (03PS4) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [13:28:32] (03PS4) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [13:28:34] (03PS5) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [13:28:36] (03PS5) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [13:28:38] (03PS5) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [13:28:40] (03PS5) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [13:28:42] (03PS5) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [13:28:44] (03PS5) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [13:28:46] (03PS5) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [13:28:48] (03PS5) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [13:28:50] (03PS5) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [13:28:52] (03PS5) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [13:28:54] (03PS5) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [13:28:56] (03PS5) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [13:29:26] (03CR) 10CI reject: [V: 04-1] OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [13:33:16] (03PS5) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [13:33:18] (03PS5) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [13:33:20] (03PS5) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [13:33:22] (03PS5) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [13:33:24] (03PS6) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [13:33:26] (03PS6) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [13:33:28] (03PS6) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [13:33:30] (03PS6) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [13:33:32] (03PS6) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [13:33:32] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:34] (03PS6) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [13:33:36] (03PS6) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [13:33:38] (03PS6) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [13:33:40] (03PS6) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [13:33:42] (03PS6) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [13:33:44] (03PS6) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [13:33:46] (03PS6) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [13:39:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:42] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:43:48] (03CR) 10Samtar: [C: 03+1] "lgtm 👀" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824865 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [13:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:16] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:38] (03CR) 10CI reject: [V: 04-1] Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [14:22:42] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:32:02] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:00] !log krinkle@mwmaint1002 foreachwikiindblist 'small - closed' deleteEqualMessages.php [14:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:36:48] !log krinkle@mwmaint1002 foreachwikiindblist 'all - small' deleteEqualMessages.php [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:24] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [14:43:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:02:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:04:36] (03PS7) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [16:04:38] (03PS7) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [16:04:40] (03PS7) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [16:04:42] (03PS7) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [16:04:44] (03PS7) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [16:04:46] (03PS7) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [16:04:48] (03PS7) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [16:04:50] (03PS7) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [16:04:52] (03PS7) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [16:07:34] (03CR) 10CI reject: [V: 04-1] Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [16:10:59] (03PS8) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [16:11:01] (03PS8) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [16:11:03] (03PS8) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [16:11:05] (03PS8) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [16:11:07] (03PS8) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [16:11:09] (03PS8) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [16:11:11] (03PS8) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [16:11:13] (03PS8) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [16:11:15] (03PS8) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [16:14:44] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:46] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:55:46] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:06:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:50] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:49:12] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:03:51] (03PS1) 10Andrew Bogott: Openstack Designate codfw1dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/824885 (https://phabricator.wikimedia.org/T296561) [18:11:43] (03PS1) 10Andrew Bogott: Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) [18:24:36] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:26:56] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:49:08] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:23:10] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:10:06] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:12:28] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:23:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:34] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:54:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:56:43] (03PS6) 10Andrew Bogott: OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) [20:56:45] (03PS6) 10Andrew Bogott: Remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [20:56:47] (03PS6) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [20:56:49] (03PS6) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [20:56:51] (03PS7) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [20:56:53] (03PS7) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [20:56:55] (03PS7) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [20:56:57] (03PS9) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [20:56:59] (03PS9) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [20:57:01] (03PS9) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [20:57:03] (03PS9) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [20:57:05] (03PS9) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [20:57:07] (03PS9) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [20:57:09] (03PS9) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [20:57:11] (03PS9) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [20:57:13] (03PS9) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [20:57:15] (03PS2) 10Andrew Bogott: Openstack Designate codfw1dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/824885 (https://phabricator.wikimedia.org/T296561) [20:57:17] (03PS2) 10Andrew Bogott: Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) [21:01:56] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36865/cloudcontrol2004-dev.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:04:39] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add files and templates for release Xena [puppet] - 10https://gerrit.wikimedia.org/r/824831 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:05:05] (03PS7) 10Andrew Bogott: Trove: remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [21:06:20] (03PS8) 10Andrew Bogott: Trove: remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) [21:06:22] (03PS7) 10Andrew Bogott: Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) [21:06:24] (03PS7) 10Andrew Bogott: Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) [21:06:26] (03PS8) 10Andrew Bogott: Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) [21:06:28] (03PS8) 10Andrew Bogott: Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) [21:06:30] (03PS8) 10Andrew Bogott: Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) [21:06:32] (03PS10) 10Andrew Bogott: Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) [21:06:34] (03PS10) 10Andrew Bogott: Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) [21:06:36] (03PS10) 10Andrew Bogott: Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) [21:06:39] (03PS10) 10Andrew Bogott: Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) [21:06:40] (03PS10) 10Andrew Bogott: Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) [21:06:43] (03PS10) 10Andrew Bogott: Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) [21:06:45] (03PS10) 10Andrew Bogott: Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) [21:06:47] (03PS10) 10Andrew Bogott: Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) [21:06:49] (03PS10) 10Andrew Bogott: Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) [21:06:51] (03PS3) 10Andrew Bogott: Openstack Designate codfw1dev to Xena [puppet] - 10https://gerrit.wikimedia.org/r/824885 (https://phabricator.wikimedia.org/T296561) [21:06:53] (03PS3) 10Andrew Bogott: Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) [21:06:55] (03CR) 10CI reject: [V: 04-1] Openstack codfw1dev to version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824886 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:08:01] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack client package manifests for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824834 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:08:03] (03CR) 10Andrew Bogott: [C: 03+2] Add openstack serverpackages manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824833 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:08:38] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for openstack Designate version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824837 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:08:51] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Openstack Cinder Xena [puppet] - 10https://gerrit.wikimedia.org/r/824838 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:09:20] (03CR) 10Andrew Bogott: [C: 03+2] Add Magnum manifest for OpenStack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824836 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:09:36] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add manifest for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824841 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:10:09] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Trove: replace file overlays with patch files for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824840 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:49:44] (03CR) 10Andrew Bogott: [C: 03+2] Trove: remove refs to cinder v2 api -- it was removed in X. [puppet] - 10https://gerrit.wikimedia.org/r/824832 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [21:50:28] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:50:44] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:53:33] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: replace file overlay with patch file for Xena [puppet] - 10https://gerrit.wikimedia.org/r/824842 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:07:54] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:10:14] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:18:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:56] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: add manifest for Xena services [puppet] - 10https://gerrit.wikimedia.org/r/824839 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:20:32] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:35] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for openstack Placement service, version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824846 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:20:37] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Openstack Nova version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824845 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:20:39] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for openstack barbican version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824844 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:20:41] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Openstack Heat version Xena [puppet] - 10https://gerrit.wikimedia.org/r/824843 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:21:06] (03CR) 10Andrew Bogott: [C: 03+2] Add glance manifest for Openstack Xena [puppet] - 10https://gerrit.wikimedia.org/r/824835 (https://phabricator.wikimedia.org/T296561) (owner: 10Andrew Bogott) [22:38:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:40:48] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:54:18] (03PS1) 10Tim Starling: SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) [23:56:01] (03CR) 10Tim Starling: [C: 03+2] SqlBagOStuff: Fix modtoken comparison [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824445 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling)