[00:05:00] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:31] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:06:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [00:08:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P33170 and previous config saved to /var/cache/conftool/dbconfig/20220826-000807-ladsgroup.json [00:09:15] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:19] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P33171 and previous config saved to /var/cache/conftool/dbconfig/20220826-002313-ladsgroup.json [00:28:23] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:45] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33172 and previous config saved to /var/cache/conftool/dbconfig/20220826-003819-ladsgroup.json [00:38:25] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [00:51:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:06:04] (03CR) 10Ryan Kemper: [C: 03+1] deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [01:15:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:16:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:05] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:32:17] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:57] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [01:44:46] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10wiki_willy) a:03Jclark-ctr [01:45:25] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10wiki_willy) a:03Papaul [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:01] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10Papaul) 05Open→03Declined This is duplicate of T314509 [01:51:22] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) I asked @Jclark-ctr to run the 40G fiber for row C and row D and he said he will get it done sometimes next week. Once the fiber in place I will update... [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:06:03] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:06:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:03] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:16:27] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:43] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:30:55] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:51:41] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:02:09] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:11] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:10:59] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:14:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:35] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:53] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:39:53] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:51:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:56:45] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:59:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:21] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:23] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:09:59] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:47] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:13] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:24:01] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:53] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:29:19] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:31:17] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:43] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:43] RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:33] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:43:17] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:45] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:57] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:41] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:00:39] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:05:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling for maintenance', diff saved to https://phabricator.wikimedia.org/P33173 and previous config saved to /var/cache/conftool/dbconfig/20220826-050652-ladsgroup.json [05:07:05] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:55] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:42] (03PS1) 10Marostegui: db1185: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826689 [05:09:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33174 and previous config saved to /var/cache/conftool/dbconfig/20220826-050906-ladsgroup.json [05:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33175 and previous config saved to /var/cache/conftool/dbconfig/20220826-051039-ladsgroup.json [05:11:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:13:50] (03CR) 10Marostegui: [C: 03+2] db1185: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826689 (owner: 10Marostegui) [05:15:11] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:14] (03PS1) 10Marostegui: instances.yaml: Add db1185 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826690 (https://phabricator.wikimedia.org/T313569) [05:16:09] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1185 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826690 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:17:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1185 for the first time in s5 T313569', diff saved to https://phabricator.wikimedia.org/P33176 and previous config saved to /var/cache/conftool/dbconfig/20220826-051721-marostegui.json [05:17:27] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:18:28] (03PS1) 10Marostegui: db1192: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826691 (https://phabricator.wikimedia.org/T313569) [05:18:45] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:19:23] (03CR) 10Marostegui: [C: 03+2] db1192: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826691 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:20:40] (03PS1) 10Marostegui: instances.yaml: Add db1192 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826693 (https://phabricator.wikimedia.org/T313569) [05:21:36] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1192 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826693 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:22:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33177 and previous config saved to /var/cache/conftool/dbconfig/20220826-052219-ladsgroup.json [05:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1192 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33178 and previous config saved to /var/cache/conftool/dbconfig/20220826-052233-marostegui.json [05:22:37] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:23:38] (03PS1) 10Marostegui: db1193: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826694 (https://phabricator.wikimedia.org/T313569) [05:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33179 and previous config saved to /var/cache/conftool/dbconfig/20220826-052410-ladsgroup.json [05:24:17] (03CR) 10Marostegui: [C: 03+2] db1193: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826694 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:25:33] (03PS1) 10Marostegui: instances.yaml: Add db1193 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826695 (https://phabricator.wikimedia.org/T313569) [05:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33180 and previous config saved to /var/cache/conftool/dbconfig/20220826-052544-ladsgroup.json [05:26:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1193 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826695 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1193 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33181 and previous config saved to /var/cache/conftool/dbconfig/20220826-052715-marostegui.json [05:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:30:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:33:00] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:35:07] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:36:05] (03PS1) 10Marostegui: db1194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826696 (https://phabricator.wikimedia.org/T313569) [05:36:51] (03CR) 10Marostegui: [C: 03+2] db1194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826696 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33182 and previous config saved to /var/cache/conftool/dbconfig/20220826-053724-ladsgroup.json [05:38:05] (03PS1) 10Marostegui: instances.yaml: Add db1194 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826697 (https://phabricator.wikimedia.org/T313569) [05:38:45] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1194 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826697 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:38:53] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33183 and previous config saved to /var/cache/conftool/dbconfig/20220826-053915-ladsgroup.json [05:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1194 for the first time in s7 T313569', diff saved to https://phabricator.wikimedia.org/P33184 and previous config saved to /var/cache/conftool/dbconfig/20220826-053954-marostegui.json [05:39:58] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P33185 and previous config saved to /var/cache/conftool/dbconfig/20220826-054023-root.json [05:40:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33186 and previous config saved to /var/cache/conftool/dbconfig/20220826-054048-ladsgroup.json [05:41:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33188 and previous config saved to /var/cache/conftool/dbconfig/20220826-054102-root.json [05:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33189 and previous config saved to /var/cache/conftool/dbconfig/20220826-054334-root.json [05:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33190 and previous config saved to /var/cache/conftool/dbconfig/20220826-054356-root.json [05:43:59] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:15] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186) [05:45:58] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui) [05:46:17] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui) [05:46:23] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:47:03] !log Failover m2-master [05:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33191 and previous config saved to /var/cache/conftool/dbconfig/20220826-054722-root.json [05:51:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33192 and previous config saved to /var/cache/conftool/dbconfig/20220826-055229-ladsgroup.json [05:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33193 and previous config saved to /var/cache/conftool/dbconfig/20220826-055420-ladsgroup.json [05:55:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33194 and previous config saved to /var/cache/conftool/dbconfig/20220826-055553-ladsgroup.json [05:56:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33195 and previous config saved to /var/cache/conftool/dbconfig/20220826-055607-root.json [05:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33196 and previous config saved to /var/cache/conftool/dbconfig/20220826-055839-root.json [05:58:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:59:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33197 and previous config saved to /var/cache/conftool/dbconfig/20220826-055900-root.json [05:59:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:59:58] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:01:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:01:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:01:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33198 and previous config saved to /var/cache/conftool/dbconfig/20220826-060146-ladsgroup.json [06:02:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33199 and previous config saved to /var/cache/conftool/dbconfig/20220826-060203-ladsgroup.json [06:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33200 and previous config saved to /var/cache/conftool/dbconfig/20220826-060227-root.json [06:05:36] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:06:58] (03CR) 10Ayounsi: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [06:06:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33201 and previous config saved to /var/cache/conftool/dbconfig/20220826-060658-ladsgroup.json [06:07:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33202 and previous config saved to /var/cache/conftool/dbconfig/20220826-060734-ladsgroup.json [06:11:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33203 and previous config saved to /var/cache/conftool/dbconfig/20220826-061112-root.json [06:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33204 and previous config saved to /var/cache/conftool/dbconfig/20220826-061344-root.json [06:14:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33205 and previous config saved to /var/cache/conftool/dbconfig/20220826-061405-root.json [06:17:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33206 and previous config saved to /var/cache/conftool/dbconfig/20220826-061732-root.json [06:19:48] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P33207 and previous config saved to /var/cache/conftool/dbconfig/20220826-062205-ladsgroup.json [06:26:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33208 and previous config saved to /var/cache/conftool/dbconfig/20220826-062616-root.json [06:27:56] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:28:10] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33209 and previous config saved to /var/cache/conftool/dbconfig/20220826-062849-root.json [06:29:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33210 and previous config saved to /var/cache/conftool/dbconfig/20220826-062910-root.json [06:32:19] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33211 and previous config saved to /var/cache/conftool/dbconfig/20220826-063237-root.json [06:34:04] (03PS1) 10Andrea Denisse: netmon: Configure logrotate to rotate logs as the 'librenms' user. [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) [06:37:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P33212 and previous config saved to /var/cache/conftool/dbconfig/20220826-063711-ladsgroup.json [06:38:00] (03CR) 10Andrea Denisse: "This issue happens because the directory belongs to the 'librenms' group. The directory is not world writable (drwxrwxr-x www-data librenm" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [06:39:57] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33213 and previous config saved to /var/cache/conftool/dbconfig/20220826-064121-root.json [06:43:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33214 and previous config saved to /var/cache/conftool/dbconfig/20220826-064353-root.json [06:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33215 and previous config saved to /var/cache/conftool/dbconfig/20220826-064414-root.json [06:47:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33216 and previous config saved to /var/cache/conftool/dbconfig/20220826-064742-root.json [06:49:08] (03CR) 10Muehlenhoff: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [06:51:07] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33217 and previous config saved to /var/cache/conftool/dbconfig/20220826-065217-ladsgroup.json [06:54:19] (03CR) 10ArielGlenn: [C: 03+2] P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [06:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33218 and previous config saved to /var/cache/conftool/dbconfig/20220826-065533-ladsgroup.json [06:56:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33219 and previous config saved to /var/cache/conftool/dbconfig/20220826-065626-root.json [06:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33220 and previous config saved to /var/cache/conftool/dbconfig/20220826-065858-root.json [06:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33221 and previous config saved to /var/cache/conftool/dbconfig/20220826-065919-root.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220826T0700) [07:00:11] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) Looking at CPU and disk usage (currently 150ish since some data is now on Swift) and the desired RAM, servers with "config A" would do just fine. [07:00:26] (03PS2) 10ArielGlenn: add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736) [07:01:17] (03CR) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:01:29] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:29] (03CR) 10ArielGlenn: [C: 03+2] add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736) (owner: 10ArielGlenn) [07:02:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33222 and previous config saved to /var/cache/conftool/dbconfig/20220826-070247-root.json [07:03:47] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:04:48] (03CR) 10Muehlenhoff: [C: 03+1] netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:05:17] (03PS5) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [07:05:41] (03PS6) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [07:05:44] (03PS1) 10Ladsgroup: Stop writing to old templatelinks fields in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826773 (https://phabricator.wikimedia.org/T312865) [07:06:51] (03CR) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:08:11] (03PS7) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [07:08:32] (03PS8) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [07:10:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33223 and previous config saved to /var/cache/conftool/dbconfig/20220826-071039-ladsgroup.json [07:11:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33224 and previous config saved to /var/cache/conftool/dbconfig/20220826-071131-root.json [07:14:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33225 and previous config saved to /var/cache/conftool/dbconfig/20220826-071403-root.json [07:14:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33226 and previous config saved to /var/cache/conftool/dbconfig/20220826-071424-root.json [07:16:12] (03PS9) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [07:16:14] (03PS3) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) [07:16:25] 10SRE, 10serviceops, 10serviceops-collab, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Ladsgroup) I add serviceops, I know it's a bit of stretch but that's the one that makes the most sense. Please change to another team if you think there is a better... [07:16:40] 10SRE, 10serviceops, 10serviceops-collab, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Ladsgroup) p:05Triage→03Medium [07:16:46] (03CR) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:17:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33227 and previous config saved to /var/cache/conftool/dbconfig/20220826-071751-root.json [07:18:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:22:15] (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 15% [puppet] - 10https://gerrit.wikimedia.org/r/826601 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [07:23:26] denisse|m: the alerts "CRITICAL - degraded: The following units failed: logrotate.service" will be solved when T315393 is resolved, did I understood the ticket right? [07:23:26] T315393: Logrotate is unable to rotate LibreNMS logs in the netmon instances due to insuficient permissions to read and write log files in /var/log/ - https://phabricator.wikimedia.org/T315393 [07:23:36] !log Increase roll-out of query-sorting to 15% - T314868 [07:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:41] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [07:24:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS bullseye [07:24:17] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye [07:24:17] jynus: Yes, the patch is sent here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/826771 [07:24:45] cool, thanks for the fix! I was looking at ongoing alerts [07:25:15] jynus: Sure thing, can I add you as a reviewer? :) [07:25:26] It may be a good idea so we can merge it now. ^^ [07:25:29] sure [07:25:34] let me see [07:25:42] Possibly I could add you and moritzm as reviewers. [07:25:45] RECOVERY - Check systemd state on dse-k8s-worker1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33228 and previous config saved to /var/cache/conftool/dbconfig/20220826-072545-ladsgroup.json [07:26:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33229 and previous config saved to /var/cache/conftool/dbconfig/20220826-072635-root.json [07:27:11] jynus, moritzm I added a comment describing the issue and the rationale for the fix in here: https://phabricator.wikimedia.org/T315393#8187595 [07:27:52] thanks, that helps, it's been a long time since I edited logrotate config [07:29:03] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [07:29:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33230 and previous config saved to /var/cache/conftool/dbconfig/20220826-072908-root.json [07:29:25] any worry, moritzm, regarding permissions ^ [07:29:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33231 and previous config saved to /var/cache/conftool/dbconfig/20220826-072929-root.json [07:30:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:31:44] the pcc may be referring to the wrong patch, though [07:32:02] Oh, let me take a double look at PCC. [07:32:48] I ran https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36986/ [07:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33232 and previous config saved to /var/cache/conftool/dbconfig/20220826-073256-root.json [07:33:28] jynus: Thanks, I ran it too. :P Let me stop this job: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36987/ [07:33:38] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) >>! In T315398#8167048, @ori wrote: > So 'powersave' with EPP=0 gives a broader range of operating frequencies than 'performance'. We should see if in th... [07:34:08] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36986/" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [07:34:48] QQ jynus , do you know why the PCC says "no change"? [07:35:09] More specifically, I'm curious as to why it won't change the config file with that patch. [07:35:32] I'm not sure if puppet compiler tracks imported files, any only puppet resources [07:35:47] so it won't be show on the diff, but let me double check the code referring it [07:36:03] I have some confusion about how to build event-driven systems on Wikimedia data. Let's say I want to build a production-cluster service with a local replica of wikidata summaries in every language, to prevent causing a high load on the wikidata query service. Do I watch the resource_change kafka topic, make an API request to wikidata to fetch every changed item, and then cache the summary [07:36:09] text? This would shift the load onto the API server so I don't feel good about the idea. [07:37:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [07:38:05] PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:11] denisse|m: yeah, I think it looks good to me [07:38:30] Thanks Jaime. :) [07:38:49] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Configure logrotate to rotate logs as the 'librenms' user. [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [07:38:54] merge and hopefully we can get rid of those 3 alarms! :-D [07:39:38] Merged, running puppet in the netmon instances. [07:39:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:40:36] (03PS4) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [07:40:40] (03PS11) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [07:40:51] the actuall puppet run should show the diff, however [07:40:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33233 and previous config saved to /var/cache/conftool/dbconfig/20220826-074052-ladsgroup.json [07:40:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:41:15] jynus: I confirm that the files are updated after puppet run. 😉 [07:41:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:41:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage [07:41:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33234 and previous config saved to /var/cache/conftool/dbconfig/20220826-074126-ladsgroup.json [07:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33235 and previous config saved to /var/cache/conftool/dbconfig/20220826-074140-root.json [07:42:01] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:02] yeah, so I remembered there is some limitation with the compiler, but I didn't remember the details- it won't show diffs of imported files [07:42:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33236 and previous config saved to /var/cache/conftool/dbconfig/20220826-074252-ladsgroup.json [07:43:09] I confirm that the alert is gone with the new config. [07:43:17] nice! [07:44:00] (03PS1) 10Majavah: P:toolforge:k8s:haproxy: increase 404 handler timeout [puppet] - 10https://gerrit.wikimedia.org/r/826779 [07:44:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33237 and previous config saved to /var/cache/conftool/dbconfig/20220826-074412-root.json [07:44:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33238 and previous config saved to /var/cache/conftool/dbconfig/20220826-074434-root.json [07:44:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage [07:45:56] (03CR) 10Ladsgroup: Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [07:46:41] as it was reminded on the last meeting, hopefully we can have a cleaner alerts (by fixing errors or with acked/dowtimed criticals) to improve the signal/noise ratio [07:47:07] *alerts dashboard [07:47:48] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:48:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33239 and previous config saved to /var/cache/conftool/dbconfig/20220826-074801-root.json [07:48:55] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33240 and previous config saved to /var/cache/conftool/dbconfig/20220826-074905-ladsgroup.json [07:49:12] (03CR) 10David Caro: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/826779 (owner: 10Majavah) [07:58:03] (03PS12) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [07:58:05] (03PS1) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 [08:01:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS bullseye [08:01:21] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye completed: - ganeti2025 (**PASS**) - Downtimed on... [08:02:20] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33241 and previous config saved to /var/cache/conftool/dbconfig/20220826-080411-ladsgroup.json [08:07:14] (03PS1) 10Ori: Increase roll-out of query-sorting to 30% [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868) [08:09:10] (03PS5) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) [08:09:12] (03PS1) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 [08:10:08] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:19] (03PS6) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [08:10:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:12:52] (03CR) 10Matthias Mullie: Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [08:13:32] (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (owner: 10Vgutierrez) [08:15:12] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:16:42] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33242 and previous config saved to /var/cache/conftool/dbconfig/20220826-081918-ladsgroup.json [08:19:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [08:20:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [08:20:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2025.codfw.wmnet to cluster codfw and group D [08:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2025.codfw.wmnet to cluster codfw and group D [08:22:58] (03PS2) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 [08:24:11] (03PS1) 10Vgutierrez: Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616 [08:24:45] (03PS7) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [08:25:09] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) One off script for that, tested on netbox-next: https://netbox-next.wikimedia.org/ipam/fhrp-groups/ `lang=python,name=Move VRRP IPs to FHRP... [08:26:59] (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (owner: 10Vgutierrez) [08:27:09] (03CR) 10Ayounsi: [C: 03+1] Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [08:31:09] (03PS3) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 [08:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33243 and previous config saved to /var/cache/conftool/dbconfig/20220826-083424-ladsgroup.json [08:35:50] (03PS5) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [08:39:13] (03CR) 10CI reject: [V: 04-1] Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [08:43:26] (03PS6) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [08:43:37] (03PS1) 10Majavah: openstack: keystone: enable app credentials on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) [08:44:29] (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36988/console" [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [08:44:39] (03PS2) 10Majavah: openstack: keystone: enable app credentials on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) [08:44:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33244 and previous config saved to /var/cache/conftool/dbconfig/20220826-084441-ladsgroup.json [08:45:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36990/console" [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah) [08:46:04] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:46:43] (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 30% [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [08:47:08] !log Increase roll-out of query-sorting to 30% - T314868 [08:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:12] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [08:49:14] (03CR) 10FNegri: ceph.bootstrap_and_add: add support to change the osd class type (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:55:05] (03CR) 10FNegri: [C: 03+1] "LGTM, but I'm still pretty new to our OpenStack setup and to OpenStack in general, so I'd love another pair of eyes." [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah) [08:56:14] 10SRE, 10Search-Console-access-request: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) [08:59:08] (03PS4) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 [08:59:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P33245 and previous config saved to /var/cache/conftool/dbconfig/20220826-085947-ladsgroup.json [09:00:47] (03PS1) 10Muehlenhoff: Allow cookbooks to handle restarts based on running one of more commands [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 [09:00:49] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Get rid of disable_coalescing() in Lua (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [09:02:45] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:05:25] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:08:51] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubestage: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:11:47] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:14:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P33246 and previous config saved to /var/cache/conftool/dbconfig/20220826-091454-ladsgroup.json [09:18:10] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:20:06] (03PS3) 10Clément Goubert: ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) [09:20:13] (03CR) 10Clément Goubert: [V: 03+2] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:21:49] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:23:01] (03PS2) 10Vgutierrez: Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616 (https://phabricator.wikimedia.org/T315911) [09:24:28] (03PS1) 10Jcrespo: mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 [09:26:03] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubernetes: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:30:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33247 and previous config saved to /var/cache/conftool/dbconfig/20220826-093000-ladsgroup.json [09:30:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:30:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [09:30:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33248 and previous config saved to /var/cache/conftool/dbconfig/20220826-093034-ladsgroup.json [09:30:35] (03PS1) 10Cathal Mooney: Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955) [09:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33249 and previous config saved to /var/cache/conftool/dbconfig/20220826-093051-ladsgroup.json [09:32:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:32:37] (03PS2) 10Cathal Mooney: Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955) [09:32:51] (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [09:33:31] !log disable origin coalescing in cp6007 and cp6008 - T315911 [09:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [09:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33250 and previous config saved to /var/cache/conftool/dbconfig/20220826-093558-ladsgroup.json [09:38:45] (03CR) 10Hashar: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [09:39:43] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:40:03] (03CR) 10Hashar: [C: 03+1] Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [09:41:17] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:43:38] 10SRE, 10DynamicPageList (Wikimedia), 10serviceops, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Ladsgroup) p:05High→03Medium We added max execution time of ten seconds to all DPL queries, that'd mitigate part of the risk, so I'm redu... [09:43:54] (03CR) 10Volans: "FYI comment inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [09:44:36] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] ml-serve: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:44:38] (03PS5) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 [09:46:21] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 (owner: 10Jcrespo) [09:51:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P33251 and previous config saved to /var/cache/conftool/dbconfig/20220826-095104-ladsgroup.json [09:51:21] (03PS2) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) [09:51:39] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:51:56] (03CR) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [09:53:59] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:17] (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [09:55:39] (03PS4) 10Clément Goubert: C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) [09:56:13] !log testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016 [09:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:17] (03PS2) 10Jcrespo: mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 [09:56:23] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:57:03] there was an increase on codfw requests, this alert will likely be noisy as codfw traffic starts ramping up [09:57:22] (03CR) 10Clément Goubert: [C: 03+2] "Just a rebase" [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [09:57:34] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 (owner: 10Jcrespo) [09:59:48] (03PS1) 10Muehlenhoff: Starting with Bullseye the systemd unit for systemd-logind uses ProtectSystem=strict, which doesn't work with HDFS and results in a failing systemd-logind service. [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) [10:01:07] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff) [10:03:36] (03CR) 10CI reject: [V: 04-1] Starting with Bullseye the systemd unit for systemd-logind uses ProtectSystem=strict, which doesn't work with HDFS and results in a failing systemd-logind service. [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff) [10:04:47] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Added above patch to delegate this range to the WMCS name servers. I hadn't checked the naming convention previously, I do actually... [10:06:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P33252 and previous config saved to /var/cache/conftool/dbconfig/20220826-100611-ladsgroup.json [10:06:34] (03CR) 10Volans: "reply inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [10:09:40] (03CR) 10Hashar: [C: 03+1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:10:42] (03CR) 10Ayounsi: [C: 03+1] "All good! I also checked the generated config locally with Junoser" [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [10:10:46] (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [10:11:41] (03PS2) 10Muehlenhoff: Exclude /mnt from systemd-logind restrictions on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) [10:12:55] (03CR) 10Muehlenhoff: [C: 03+1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:13:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff) [10:13:30] !log stop testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016 [10:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:14] (03PS1) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) [10:15:36] (03CR) 10CI reject: [V: 04-1] Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:21:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33253 and previous config saved to /var/cache/conftool/dbconfig/20220826-102117-ladsgroup.json [10:21:31] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10MoritzMuehlenhoff) Can you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/826806/ a shot on clouddumps? It should addre... [10:23:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33254 and previous config saved to /var/cache/conftool/dbconfig/20220826-102334-ladsgroup.json [10:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33255 and previous config saved to /var/cache/conftool/dbconfig/20220826-102510-ladsgroup.json [10:25:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:25:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:29:45] RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:54] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff) [10:33:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:33:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:33:50] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [10:36:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [10:36:55] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [10:37:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33256 and previous config saved to /var/cache/conftool/dbconfig/20220826-103707-ladsgroup.json [10:43:00] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:44:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. [10:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33257 and previous config saved to /var/cache/conftool/dbconfig/20220826-104427-ladsgroup.json [10:44:31] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [10:45:45] (03PS6) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) [10:46:22] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10jcrespo) [10:47:16] 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) [10:47:21] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) [10:47:39] (03PS5) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) [10:47:56] 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) An initial test of https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785/6/modules/profile/files/trafficserver/default.lua (PS6) in cp6016 triggered T316337 [10:48:23] (03CR) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:48:41] I am so overengineering things some time :) [10:51:12] 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium [10:55:04] (03CR) 10FNegri: "If I understand correctly, the advantage of this patch is that running 'tox' locally becomes faster because only one Python version is use" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [10:56:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [10:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P33258 and previous config saved to /var/cache/conftool/dbconfig/20220826-105934-ladsgroup.json [10:59:42] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10jcrespo) Preliminary working doc: https://docs.google.com/document/d/1Ka9MQB8OwdzAzJVfZuaIGo5VfnyRNRr_WxLPZ6YFMkE [11:00:51] (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33259 and previous config saved to /var/cache/conftool/dbconfig/20220826-111234-root.json [11:12:46] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) Actually, let me not step on your toes. But if you can tolerate a short extension of this task, I would very much like to see this setting tested. I thin... [11:13:43] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) [11:14:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P33260 and previous config saved to /var/cache/conftool/dbconfig/20220826-111440-ladsgroup.json [11:15:09] (03CR) 10FNegri: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:16:32] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Also just a note on the setup of the WMCS DNS in general. It seems BIND won't resolve any of these names because the CNAMEs on the... [11:18:01] (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:18:37] (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:18:40] (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:19:21] !log uploaded intel-microcode 3.20220510.1~wmf9u1 to apt.wikimedia.org [11:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:05] (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro) [11:27:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33261 and previous config saved to /var/cache/conftool/dbconfig/20220826-112739-root.json [11:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33262 and previous config saved to /var/cache/conftool/dbconfig/20220826-112946-ladsgroup.json [11:29:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:30:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:33:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:33:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33263 and previous config saved to /var/cache/conftool/dbconfig/20220826-113347-ladsgroup.json [11:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33264 and previous config saved to /var/cache/conftool/dbconfig/20220826-113511-ladsgroup.json [11:37:18] !log installing intel-microcode updates on stretch hosts [11:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33265 and previous config saved to /var/cache/conftool/dbconfig/20220826-114008-ladsgroup.json [11:42:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33266 and previous config saved to /var/cache/conftool/dbconfig/20220826-114243-root.json [11:51:58] (03PS1) 10Clément Goubert: kubernetes: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) [11:53:00] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36991/console" [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:53:39] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) >>! In T315398#8187684, @ori wrote: >>>! In T315398#8167048, @ori wrote: >> So 'powersave' with EPP=0 gives a broader range of operating frequencie... [11:54:16] (03CR) 10Clément Goubert: [V: 03+1] "The masters slipped through the cleanup, this fixes it." [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [11:55:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P33267 and previous config saved to /var/cache/conftool/dbconfig/20220826-115514-ladsgroup.json [11:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33268 and previous config saved to /var/cache/conftool/dbconfig/20220826-115748-root.json [11:58:45] (03PS1) 10Clément Goubert: kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) [12:00:13] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:13] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:19] (03PS1) 10Clément Goubert: ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) [12:10:08] (03PS1) 10Muehlenhoff: prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835 [12:10:18] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36993/console" [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [12:10:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P33269 and previous config saved to /var/cache/conftool/dbconfig/20220826-121021-ladsgroup.json [12:12:11] (03PS1) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) [12:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33270 and previous config saved to /var/cache/conftool/dbconfig/20220826-121253-root.json [12:14:22] (03PS2) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) [12:19:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) a:03cmooney [12:20:40] 10SRE, 10Infrastructure-Foundations, 10netops: Return AS43821 to RIPE - https://phabricator.wikimedia.org/T314471 (10cmooney) 05In progress→03Resolved This has been completed and records cleared up. [12:21:02] (03PS1) 10Clément Goubert: kubernetes: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826840 (https://phabricator.wikimedia.org/T316341) [12:21:58] (03PS1) 10Muehlenhoff: elasticsearch::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826841 [12:22:01] 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10cmooney) 05Open→03Resolved I'm going to close this task for now. If, as seems likely, we wish to deploy Dell as an alternate vendor in production w... [12:25:10] (03CR) 10Btullis: "I've taken a copy of the ml-serve.yaml values to begin with, but removed some namespaces and edited the IP addresses etc for dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [12:25:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33271 and previous config saved to /var/cache/conftool/dbconfig/20220826-122527-ladsgroup.json [12:26:07] (03PS1) 10Clément Goubert: ml-serve: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826842 (https://phabricator.wikimedia.org/T316341) [12:26:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826841 (owner: 10Muehlenhoff) [12:27:15] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33272 and previous config saved to /var/cache/conftool/dbconfig/20220826-122758-root.json [12:30:10] (03PS1) 10FNegri: Add cloudcephosd1030 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826843 (https://phabricator.wikimedia.org/T314870) [12:31:22] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Aklapper) [12:31:44] (03PS1) 10Btullis: We wish to upgrade datahub to version 0.8.43 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826844 (https://phabricator.wikimedia.org/T316336) [12:31:51] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:31] (03PS1) 10Clément Goubert: dse-k8s: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341) [12:35:26] (03PS1) 10Muehlenhoff: profile::maps::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826847 [12:37:18] (03PS1) 10Clément Goubert: deployment-server: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826849 (https://phabricator.wikimedia.org/T316341) [12:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33273 and previous config saved to /var/cache/conftool/dbconfig/20220826-123743-ladsgroup.json [12:41:44] (03PS1) 10Clément Goubert: releases: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826852 (https://phabricator.wikimedia.org/T316341) [12:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33274 and previous config saved to /var/cache/conftool/dbconfig/20220826-124303-root.json [12:47:33] (03PS1) 10Clément Goubert: builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) [12:48:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826847 (owner: 10Muehlenhoff) [12:52:45] (03PS1) 10Muehlenhoff: varnish::common: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826855 [12:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P33275 and previous config saved to /var/cache/conftool/dbconfig/20220826-125250-ladsgroup.json [12:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33276 and previous config saved to /var/cache/conftool/dbconfig/20220826-125808-root.json [12:59:28] (03PS1) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) [13:00:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:02:34] (03PS1) 10Muehlenhoff: mariadb::config: Remove old tmpfile hack [puppet] - 10https://gerrit.wikimedia.org/r/826858 [13:02:52] (03PS1) 10JMeybohm: Run helm dependency build before packaging [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347) [13:03:03] (03CR) 10Bking: [C: 03+2] deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [13:05:21] (03PS2) 10Clément Goubert: builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) [13:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P33277 and previous config saved to /var/cache/conftool/dbconfig/20220826-130756-ladsgroup.json [13:08:34] (03CR) 10Clément Goubert: kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:09:29] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10ayounsi) [13:11:41] (03CR) 10Andrew Bogott: [C: 03+2] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah) [13:13:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33278 and previous config saved to /var/cache/conftool/dbconfig/20220826-131312-root.json [13:14:46] 10SRE, 10Infrastructure-Foundations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10ayounsi) [13:14:50] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) [13:16:01] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) >>! In T315955#8188444, @cmooney wrote: > Also just a note on the setup of the WMCS DNS in general. > > It seems BIND won't resolve... [13:16:36] (03PS1) 10Clément Goubert: ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) [13:17:08] (03CR) 10Marostegui: [C: 03+1] mariadb::config: Remove old tmpfile hack [puppet] - 10https://gerrit.wikimedia.org/r/826858 (owner: 10Muehlenhoff) [13:18:20] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37002/console" [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [13:20:28] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [13:21:28] (03PS3) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) [13:21:30] (03PS2) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) [13:22:37] (03CR) 10CI reject: [V: 04-1] Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33279 and previous config saved to /var/cache/conftool/dbconfig/20220826-132304-ladsgroup.json [13:23:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:23:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:23:28] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:23:43] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:27:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:27:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33280 and previous config saved to /var/cache/conftool/dbconfig/20220826-132751-ladsgroup.json [13:28:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33281 and previous config saved to /var/cache/conftool/dbconfig/20220826-132817-root.json [13:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:30:12] (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/826862 [13:30:45] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:20] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/826862 (owner: 10Marostegui) [13:32:10] (03CR) 10Clément Goubert: ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [13:33:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33283 and previous config saved to /var/cache/conftool/dbconfig/20220826-133318-ladsgroup.json [13:34:12] (03CR) 10JMeybohm: "Pipeline fails with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:36:02] (03PS3) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) [13:39:38] (03PS1) 10Muehlenhoff: codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864 [13:41:54] (03PS1) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) [13:42:37] (03CR) 10CI reject: [V: 04-1] varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:43:07] (03PS1) 10Muehlenhoff: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 [13:43:42] (03CR) 10JMeybohm: R:profile::docker::engine::version removal and cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:44:01] (03CR) 10CI reject: [V: 04-1] rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [13:44:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33284 and previous config saved to /var/cache/conftool/dbconfig/20220826-134426-ladsgroup.json [13:44:42] (03PS2) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) [13:44:59] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:26] (03PS2) 10Muehlenhoff: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 [13:47:25] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:32] (03PS3) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) [13:48:50] (03CR) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert) [13:49:35] (03PS1) 10Muehlenhoff: routinator: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826869 [13:53:23] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:56:55] (03PS1) 10BBlack: WIP - send caching attribute to BE layer [puppet] - 10https://gerrit.wikimedia.org/r/826871 [13:58:53] (03PS3) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) [13:59:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P33285 and previous config saved to /var/cache/conftool/dbconfig/20220826-135932-ladsgroup.json [14:00:41] (03PS2) 10BBlack: WIP - send caching attribute to BE layer [puppet] - 10https://gerrit.wikimedia.org/r/826871 [14:06:25] (03CR) 10JMeybohm: [C: 03+1] kubernetes: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [14:06:31] (03CR) 10JMeybohm: [C: 03+1] ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [14:10:33] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10Volans) [14:12:08] 10SRE, 10Image-Suggestions, 10serviceops: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10lbowmaker) [14:13:03] (03PS4) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) [14:13:21] 10SRE, 10Image-Suggestions, 10serviceops, 10Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10lbowmaker) [14:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P33286 and previous config saved to /var/cache/conftool/dbconfig/20220826-141438-ladsgroup.json [14:18:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [14:25:30] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10lbowmaker) [14:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33288 and previous config saved to /var/cache/conftool/dbconfig/20220826-142945-ladsgroup.json [14:34:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33289 and previous config saved to /var/cache/conftool/dbconfig/20220826-143402-ladsgroup.json [14:38:43] !log rolling restart of backup1004-9, backup2004-9 [14:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:15] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:39] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:47:45] RECOVERY - HP RAID on ms-be1054 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:48:48] (03PS5) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) [14:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P33290 and previous config saved to /var/cache/conftool/dbconfig/20220826-144908-ladsgroup.json [14:51:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Replaced failed Drive [14:52:53] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10Jclark-ctr) 05Open→03Resolved Replaced failed Drive [14:54:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) @MatthewVernon Can these be swapped at anytime? [14:54:13] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:00:18] (03CR) 10Volans: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [15:03:15] 10SRE, 10Infrastructure-Foundations, 10netops: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) p:05Triage→03Medium [15:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P33291 and previous config saved to /var/cache/conftool/dbconfig/20220826-150415-ladsgroup.json [15:04:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10jcrespo) @Jclark-ctr , Matthew is away on vacations- but I may be able to help you, do you need to shutdown the server for the disk change? [15:08:52] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): decom cookbook often fails to wipe drives in HP systems - https://phabricator.wikimedia.org/T316292 (10Volans) 05Open→03Invalid The error reported in T316285#8186856 clearly states: > **Unable to connect to the host, wipe of swra... [15:10:50] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) >>! In T316285#8187029, @Andrew wrote: > @cmjohnson, this is another host that will need its drives wiped, as the cookbook seems... [15:19:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33292 and previous config saved to /var/cache/conftool/dbconfig/20220826-151921-ladsgroup.json [15:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:19:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:19:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:19:53] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Jclark-ctr) contint1002 B1 U38 port38 cableid 23000029 [15:19:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Jclark-ctr) [15:19:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33293 and previous config saved to /var/cache/conftool/dbconfig/20220826-152003-ladsgroup.json [15:23:52] (03CR) 10FNegri: "I did read the Phab task but I still have a couple questions ;)" [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [15:29:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[01] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) @akosiaris Can you verify host names? kubernetes102[01] Already in use Racking task T290202 [15:30:56] (03PS6) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) [15:32:05] (03CR) 10BBlack: [C: 03+1] varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [15:34:54] (03CR) 10Vgutierrez: "varnishtest is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [15:38:02] (03PS1) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) [15:38:35] (03CR) 10CI reject: [V: 04-1] Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [15:41:16] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10JAnstee_WMF) I checked today following the switch overnight - It seems we are still able to send invites, and it is still sending to spam via qualtri... [15:41:55] (03PS2) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) [15:42:42] (03CR) 10CI reject: [V: 04-1] Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [15:46:05] (03PS7) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) [15:46:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) @jcrespo just wanted to make sure drives are able just be replaced they are hotswapable just want to verify prior to replacing [15:47:04] (03PS8) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) [15:50:47] !log rolling restart of ms-backup1001,2, ms-backup2001,2 [15:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:03] (03PS9) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) [15:52:41] (03CR) 10BBlack: [C: 03+1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [15:56:33] (03PS1) 10Dzahn: trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) [15:57:48] (03PS1) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [15:59:18] (03PS2) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [16:01:24] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/826884 it can't get traffic anymore" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [16:01:57] (03PS2) 10Dzahn: trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) [16:20:10] (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/826847 (owner: 10Muehlenhoff) [16:20:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33295 and previous config saved to /var/cache/conftool/dbconfig/20220826-162019-ladsgroup.json [16:21:36] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this not g... [16:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P33296 and previous config saved to /var/cache/conftool/dbconfig/20220826-163525-ladsgroup.json [16:36:03] (03PS3) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) [16:40:15] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10mpopov) > The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this no... [16:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P33297 and previous config saved to /var/cache/conftool/dbconfig/20220826-165032-ladsgroup.json [16:56:12] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@5d95fe5]: Add job for MediaWiki history dumps. [16:56:25] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@5d95fe5]: Add job for MediaWiki history dumps. (duration: 00m 13s) [16:59:39] (03CR) 10Dzahn: [C: 03+2] gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:03:32] (03CR) 10Dzahn: [C: 03+2] "sshd has been refreshed by puppet on both gerrit servers, I can still ssh to them and watching replication.log everything looks normal" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [17:04:31] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:05:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33298 and previous config saved to /var/cache/conftool/dbconfig/20220826-170538-ladsgroup.json [17:05:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [17:05:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [17:06:45] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:08:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:09:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [17:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33299 and previous config saved to /var/cache/conftool/dbconfig/20220826-170911-ladsgroup.json [17:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33300 and previous config saved to /var/cache/conftool/dbconfig/20220826-171638-ladsgroup.json [17:28:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder) [17:30:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10KFrancis) @Ladsgroup I am confirming the signed NDA. Please proceed with the access request! Thanks! [17:31:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P33301 and previous config saved to /var/cache/conftool/dbconfig/20220826-173144-ladsgroup.json [17:44:12] (03PS1) 10Dzahn: admin: add Jonathan Fraine to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044) [17:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P33302 and previous config saved to /var/cache/conftool/dbconfig/20220826-174651-ladsgroup.json [18:01:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33303 and previous config saved to /var/cache/conftool/dbconfig/20220826-180157-ladsgroup.json [18:02:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [18:02:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [18:02:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33304 and previous config saved to /var/cache/conftool/dbconfig/20220826-180223-ladsgroup.json [18:04:20] (03CR) 10Ssingh: "Looking at e5b62c8e9d0, it seems like we added docker tests on purpose. That further links to https://phabricator.wikimedia.org/T286639 an" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [18:05:17] (03CR) 10Ssingh: [C: 03+1] varnish::common: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826855 (owner: 10Muehlenhoff) [18:09:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33305 and previous config saved to /var/cache/conftool/dbconfig/20220826-180943-ladsgroup.json [18:24:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: add Jonathan Fraine to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044) (owner: 10Dzahn) [18:24:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P33306 and previous config saved to /var/cache/conftool/dbconfig/20220826-182450-ladsgroup.json [18:26:36] (03CR) 10Dzahn: "thanks, added to LDAP groups on mwmaint1002" [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044) (owner: 10Dzahn) [18:28:19] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Dzahn) [18:30:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) 05Open→03Resolved a:03Dzahn Daniel did most of the work :) [18:31:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Dzahn) @jdfraine You have been added to the same groups as other WMDE employees. The logins (and Gerrit privileges) should work now. [18:33:57] (03CR) 10Dzahn: "link courtesy of @Urbanecm (thanks): https://w.wiki/5d3s (843 hits in 30 days but sampled)" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [18:35:35] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) link courtesy of @Urbanecm (thanks): https://w.wiki/5d3s (843 hits in 30 days but sampled) [18:38:42] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) An alternative way to shut it down would be to remove it first from DNS and later do everything else. Then potential users would just... [18:39:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P33307 and previous config saved to /var/cache/conftool/dbconfig/20220826-183956-ladsgroup.json [18:40:34] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@c5f46a4]: (no justification provided) [18:40:44] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@c5f46a4]: (no justification provided) (duration: 00m 10s) [18:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33308 and previous config saved to /var/cache/conftool/dbconfig/20220826-185502-ladsgroup.json [18:55:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:55:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [18:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33309 and previous config saved to /var/cache/conftool/dbconfig/20220826-185527-ladsgroup.json [19:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33310 and previous config saved to /var/cache/conftool/dbconfig/20220826-190151-ladsgroup.json [19:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P33311 and previous config saved to /var/cache/conftool/dbconfig/20220826-191657-ladsgroup.json [19:20:29] (03CR) 10Dzahn: "opinions what's better - remove from ATS first or remove from DNS first? ("unknown domain" error page vs. NXDOMAIN)?" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [19:32:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P33312 and previous config saved to /var/cache/conftool/dbconfig/20220826-193203-ladsgroup.json [19:47:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33313 and previous config saved to /var/cache/conftool/dbconfig/20220826-194709-ladsgroup.json [19:47:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:47:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33314 and previous config saved to /var/cache/conftool/dbconfig/20220826-194734-ladsgroup.json [19:47:35] (03CR) 10Dzahn: "this is already not a virtual host on cluster apache anymore ([mwdebug1001:/] $ sudo apache2ctl -S | grep vhost)" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [19:48:17] (03PS3) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [19:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33315 and previous config saved to /var/cache/conftool/dbconfig/20220826-195351-ladsgroup.json [19:55:25] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P33316 and previous config saved to /var/cache/conftool/dbconfig/20220826-200858-ladsgroup.json [20:16:46] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) This is the process I followed to update the UID/GID: 1. Backup... [20:19:43] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:24:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P33318 and previous config saved to /var/cache/conftool/dbconfig/20220826-202404-ladsgroup.json [20:32:15] (03CR) 10Dzahn: [C: 03+2] c:spamassassin remove cronjob, and use systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [20:33:41] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:16] ^ Woking on that one. [20:34:18] (03CR) 10Dzahn: [C: 03+2] "otrs1001 - Process: 29134 ExecStart=/usr/local/sbin/spamassassin_updates (code=exited, status=0/SUCCESS)" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [20:34:20] I've already found the issue. [20:34:34] denisse|m: great! just started to wonder if that is the one [20:34:53] I think you can also merge your change. it has +1 now [20:36:05] (03CR) 10Dzahn: [C: 03+2] "thank you Slyngshede - looks good now on otrs1001 - will let you know if I ever see it again" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [20:37:19] denisse|m: oh, it seems like you are pointing out that I should have also done it for both UID and GID [20:37:31] for 920 [20:38:33] Oh, I matched the UID and GID to be the same mostly for simplicity. :) [20:38:57] I mean how in https://gerrit.wikimedia.org/r/c/operations/puppet/+/826427/9/modules/admin/data/data.yaml you are adding it in 2 locations [20:39:05] one of them has 920 above it and the other one does not [20:39:10] 920 was added by me [20:39:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33319 and previous config saved to /var/cache/conftool/dbconfig/20220826-203910-ladsgroup.json [20:39:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:39:18] looks like I forgot one of 2 places [20:39:24] but why is it duplicated [20:39:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:39:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33320 and previous config saved to /var/cache/conftool/dbconfig/20220826-203935-ladsgroup.json [20:39:51] The 'librenms-poller-all' is working again in netmon1003. [20:39:55] cool [20:40:19] Ah, I get what you mean. From what I understood in that file one part reserves the GID globally while the other one is specific for UID. :) [20:40:26] That's why I added it in both places. [20:40:47] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:58] seems like I need to send a fix and you helped me see it [20:41:17] Awesome! :D [20:41:25] but it is kind of the same thing in 2 places it feels [20:41:44] gid and then uid:gid [20:43:02] (03PS4) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) [20:43:22] (03CR) 10Andrea Denisse: [V: 03+2] netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [20:43:25] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [20:44:11] Yeah, I agree that it feels the same. :/ [20:44:11] We may be able to shrink it into a single section... [20:45:06] (03PS10) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [20:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33321 and previous config saved to /var/cache/conftool/dbconfig/20220826-204555-ladsgroup.json [20:46:42] (03PS1) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) [20:47:09] yea, but until then I will make it consistent [20:48:17] (03PS2) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) [20:48:19] (03PS11) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [20:48:45] (03CR) 10Dzahn: "It does feel a bit duplicate though, doesn't it?" [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [21:00:11] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P33322 and previous config saved to /var/cache/conftool/dbconfig/20220826-210102-ladsgroup.json [21:03:17] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:03:47] (03CR) 10RLazarus: "Great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:03:55] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:49] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [21:05:39] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:10:49] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [21:11:31] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [21:12:10] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [21:16:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P33323 and previous config saved to /var/cache/conftool/dbconfig/20220826-211608-ladsgroup.json [21:17:57] (03PS1) 10RLazarus: httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919 [21:19:48] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37008/console" [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus) [21:20:36] (03CR) 10RLazarus: httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus) [21:30:46] (03CR) 10RLazarus: httpbb: drop tests for search.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33324 and previous config saved to /var/cache/conftool/dbconfig/20220826-213115-ladsgroup.json [21:31:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [21:31:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [21:31:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33325 and previous config saved to /var/cache/conftool/dbconfig/20220826-213140-ladsgroup.json [21:34:59] (03CR) 10Dzahn: [C: 03+1] "lgtm! https://puppet-compiler.wmflabs.org/pcc-worker1001/37008/cumin2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus) [21:35:36] (03CR) 10Dzahn: httpbb: drop tests for search.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:35:48] (03PS4) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [21:37:08] (03PS5) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [21:38:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33326 and previous config saved to /var/cache/conftool/dbconfig/20220826-213801-ladsgroup.json [21:39:11] (03PS6) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) [21:44:37] (03CR) 10RLazarus: [C: 03+2] httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus) [21:45:41] (03CR) 10RLazarus: [C: 03+1] httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:49:30] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37009/cumin2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:50:46] (03CR) 10Dzahn: [C: 03+2] httpbb: drop tests for search.wikimedia.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:51:32] (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Profile::Httpbb/Httpbb::Test_suite[apple-search/test_search.yaml]/File[/srv/deployment/httpbb-tests/apple-search/test" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn) [21:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P33327 and previous config saved to /var/cache/conftool/dbconfig/20220826-215307-ladsgroup.json [21:53:42] (03PS1) 10Dzahn: httpbb: remove absented file test_search [puppet] - 10https://gerrit.wikimedia.org/r/826923 [21:54:54] (03CR) 10Dzahn: "theoretically this could exist in cloud VPS though, I can wait and just merge it later" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn) [22:07:47] (03CR) 10RLazarus: [C: 03+1] httpbb: remove absented file test_search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn) [22:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P33328 and previous config saved to /var/cache/conftool/dbconfig/20220826-220814-ladsgroup.json [22:22:15] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33329 and previous config saved to /var/cache/conftool/dbconfig/20220826-222320-ladsgroup.json [22:23:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:23:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33330 and previous config saved to /var/cache/conftool/dbconfig/20220826-222345-ladsgroup.json [22:24:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33331 and previous config saved to /var/cache/conftool/dbconfig/20220826-222409-ladsgroup.json [22:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33332 and previous config saved to /var/cache/conftool/dbconfig/20220826-223021-ladsgroup.json [22:41:23] (03CR) 10Cwhite: [C: 03+2] logstash: set ecs routing only when the output is logstash [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [22:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P33333 and previous config saved to /var/cache/conftool/dbconfig/20220826-224527-ladsgroup.json [23:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P33334 and previous config saved to /var/cache/conftool/dbconfig/20220826-230033-ladsgroup.json [23:08:28] (03CR) 10Dzahn: [C: 03+2] "every puppetmaster should have run by now" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn) [23:09:33] /away (https://en.wikipedia.org/wiki/Thank_God_It%27s_Friday) [23:10:43] (03CR) 10Dzahn: [C: 03+2] "noop on cumin2002. laters" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn) [23:14:45] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:15:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33335 and previous config saved to /var/cache/conftool/dbconfig/20220826-231540-ladsgroup.json [23:18:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33336 and previous config saved to /var/cache/conftool/dbconfig/20220826-231856-ladsgroup.json [23:34:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P33337 and previous config saved to /var/cache/conftool/dbconfig/20220826-233402-ladsgroup.json [23:49:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P33338 and previous config saved to /var/cache/conftool/dbconfig/20220826-234908-ladsgroup.json