[00:05:00] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:31] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:06:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:08:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P33170 and previous config saved to /var/cache/conftool/dbconfig/20220826-000807-ladsgroup.json
[00:09:15] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:19] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P33171 and previous config saved to /var/cache/conftool/dbconfig/20220826-002313-ladsgroup.json
[00:28:23] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:45] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33172 and previous config saved to /var/cache/conftool/dbconfig/20220826-003819-ladsgroup.json
[00:38:25] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[00:51:27] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:06:04] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking)
[01:15:27] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:16:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:25:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:32:17] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:57] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:44:46] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10wiki_willy) a:03Jclark-ctr
[01:45:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10wiki_willy) a:03Papaul
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10Papaul) 05Open→03Declined This is duplicate of T314509
[01:51:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) I asked @Jclark-ctr to run the 40G fiber for row C and row D and he said he will get it done sometimes next week. Once the fiber in place I will update...
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:06:03] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:03] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:15:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:16:27] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:23:41] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:43] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:30:55] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:42:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[02:51:41] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:02:09] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:06:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:10:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:14:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:35] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:53] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:39:53] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:51:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:56:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:59:57] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:21] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:23] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:09:59] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:57] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:47] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:17:13] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:18:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:24:01] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:53] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:29:19] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:31] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:31:17] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:43] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:43] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-worker1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:33] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:43:17] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:43:45] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:57] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:41] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:54:35] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:00:39] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[05:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[05:06:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling for maintenance', diff saved to https://phabricator.wikimedia.org/P33173 and previous config saved to /var/cache/conftool/dbconfig/20220826-050652-ladsgroup.json
[05:07:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:55] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:08:42] <wikibugs>	 (03PS1) 10Marostegui: db1185: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826689
[05:09:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33174 and previous config saved to /var/cache/conftool/dbconfig/20220826-050906-ladsgroup.json
[05:10:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33175 and previous config saved to /var/cache/conftool/dbconfig/20220826-051039-ladsgroup.json
[05:11:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:13:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1185: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826689 (owner: 10Marostegui)
[05:15:11] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:15:14] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1185 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826690 (https://phabricator.wikimedia.org/T313569)
[05:16:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1185 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826690 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:17:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1185 for the first time in s5 T313569', diff saved to https://phabricator.wikimedia.org/P33176 and previous config saved to /var/cache/conftool/dbconfig/20220826-051721-marostegui.json
[05:17:27] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:18:28] <wikibugs>	 (03PS1) 10Marostegui: db1192: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826691 (https://phabricator.wikimedia.org/T313569)
[05:18:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:19:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1192: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826691 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:20:40] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1192 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826693 (https://phabricator.wikimedia.org/T313569)
[05:21:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1192 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826693 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:22:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P33177 and previous config saved to /var/cache/conftool/dbconfig/20220826-052219-ladsgroup.json
[05:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1192 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33178 and previous config saved to /var/cache/conftool/dbconfig/20220826-052233-marostegui.json
[05:22:37] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:23:38] <wikibugs>	 (03PS1) 10Marostegui: db1193: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826694 (https://phabricator.wikimedia.org/T313569)
[05:24:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33179 and previous config saved to /var/cache/conftool/dbconfig/20220826-052410-ladsgroup.json
[05:24:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1193: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826694 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:25:33] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1193 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826695 (https://phabricator.wikimedia.org/T313569)
[05:25:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33180 and previous config saved to /var/cache/conftool/dbconfig/20220826-052544-ladsgroup.json
[05:26:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1193 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826695 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:27:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1193 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33181 and previous config saved to /var/cache/conftool/dbconfig/20220826-052715-marostegui.json
[05:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:30:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:33:00] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:35:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:36:05] <wikibugs>	 (03PS1) 10Marostegui: db1194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826696 (https://phabricator.wikimedia.org/T313569)
[05:36:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826696 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:37:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P33182 and previous config saved to /var/cache/conftool/dbconfig/20220826-053724-ladsgroup.json
[05:38:05] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1194 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826697 (https://phabricator.wikimedia.org/T313569)
[05:38:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1194 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826697 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:38:53] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33183 and previous config saved to /var/cache/conftool/dbconfig/20220826-053915-ladsgroup.json
[05:39:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1194 for the first time in s7 T313569', diff saved to https://phabricator.wikimedia.org/P33184 and previous config saved to /var/cache/conftool/dbconfig/20220826-053954-marostegui.json
[05:39:58] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:40:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P33185 and previous config saved to /var/cache/conftool/dbconfig/20220826-054023-root.json
[05:40:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33186 and previous config saved to /var/cache/conftool/dbconfig/20220826-054048-ladsgroup.json
[05:41:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33188 and previous config saved to /var/cache/conftool/dbconfig/20220826-054102-root.json
[05:43:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33189 and previous config saved to /var/cache/conftool/dbconfig/20220826-054334-root.json
[05:43:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33190 and previous config saved to /var/cache/conftool/dbconfig/20220826-054356-root.json
[05:43:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:15] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186)
[05:45:58] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui)
[05:46:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/826698 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui)
[05:46:23] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:47:03] <marostegui>	 !log Failover m2-master
[05:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33191 and previous config saved to /var/cache/conftool/dbconfig/20220826-054722-root.json
[05:51:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:52:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P33192 and previous config saved to /var/cache/conftool/dbconfig/20220826-055229-ladsgroup.json
[05:54:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33193 and previous config saved to /var/cache/conftool/dbconfig/20220826-055420-ladsgroup.json
[05:55:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33194 and previous config saved to /var/cache/conftool/dbconfig/20220826-055553-ladsgroup.json
[05:56:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33195 and previous config saved to /var/cache/conftool/dbconfig/20220826-055607-root.json
[05:58:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33196 and previous config saved to /var/cache/conftool/dbconfig/20220826-055839-root.json
[05:58:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[05:59:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33197 and previous config saved to /var/cache/conftool/dbconfig/20220826-055900-root.json
[05:59:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[05:59:58] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:01:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[06:01:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[06:01:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33198 and previous config saved to /var/cache/conftool/dbconfig/20220826-060146-ladsgroup.json
[06:02:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33199 and previous config saved to /var/cache/conftool/dbconfig/20220826-060203-ladsgroup.json
[06:02:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 2%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33200 and previous config saved to /var/cache/conftool/dbconfig/20220826-060227-root.json
[06:05:36] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:06:58] <wikibugs>	 (03CR) 10Ayounsi: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[06:06:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33201 and previous config saved to /var/cache/conftool/dbconfig/20220826-060658-ladsgroup.json
[06:07:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P33202 and previous config saved to /var/cache/conftool/dbconfig/20220826-060734-ladsgroup.json
[06:11:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33203 and previous config saved to /var/cache/conftool/dbconfig/20220826-061112-root.json
[06:13:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33204 and previous config saved to /var/cache/conftool/dbconfig/20220826-061344-root.json
[06:14:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33205 and previous config saved to /var/cache/conftool/dbconfig/20220826-061405-root.json
[06:17:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 3%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33206 and previous config saved to /var/cache/conftool/dbconfig/20220826-061732-root.json
[06:19:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:22:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P33207 and previous config saved to /var/cache/conftool/dbconfig/20220826-062205-ladsgroup.json
[06:26:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33208 and previous config saved to /var/cache/conftool/dbconfig/20220826-062616-root.json
[06:27:56] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:28:10] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:28:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33209 and previous config saved to /var/cache/conftool/dbconfig/20220826-062849-root.json
[06:29:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33210 and previous config saved to /var/cache/conftool/dbconfig/20220826-062910-root.json
[06:32:19] <icinga-wm>	 PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:32:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33211 and previous config saved to /var/cache/conftool/dbconfig/20220826-063237-root.json
[06:34:04] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Configure logrotate to rotate logs as the 'librenms' user. [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393)
[06:37:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P33212 and previous config saved to /var/cache/conftool/dbconfig/20220826-063711-ladsgroup.json
[06:38:00] <wikibugs>	 (03CR) 10Andrea Denisse: "This issue happens because the directory belongs to the 'librenms' group. The directory is not world writable (drwxrwxr-x www-data librenm" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[06:39:57] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:41:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33213 and previous config saved to /var/cache/conftool/dbconfig/20220826-064121-root.json
[06:43:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33214 and previous config saved to /var/cache/conftool/dbconfig/20220826-064353-root.json
[06:44:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33215 and previous config saved to /var/cache/conftool/dbconfig/20220826-064414-root.json
[06:47:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33216 and previous config saved to /var/cache/conftool/dbconfig/20220826-064742-root.json
[06:49:08] <wikibugs>	 (03CR) 10Muehlenhoff: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[06:51:07] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33217 and previous config saved to /var/cache/conftool/dbconfig/20220826-065217-ladsgroup.json
[06:54:19] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah)
[06:55:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33218 and previous config saved to /var/cache/conftool/dbconfig/20220826-065533-ladsgroup.json
[06:56:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33219 and previous config saved to /var/cache/conftool/dbconfig/20220826-065626-root.json
[06:58:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33220 and previous config saved to /var/cache/conftool/dbconfig/20220826-065858-root.json
[06:59:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33221 and previous config saved to /var/cache/conftool/dbconfig/20220826-065919-root.json
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220826T0700)
[07:00:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) Looking at CPU and disk usage (currently 150ish since some data is now on Swift) and the desired RAM, servers with "config A" would do just fine.
[07:00:26] <wikibugs>	 (03PS2) 10ArielGlenn: add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736)
[07:01:17] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:01:29] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:29] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add php7.4 install to the snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736) (owner: 10ArielGlenn)
[07:02:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33222 and previous config saved to /var/cache/conftool/dbconfig/20220826-070247-root.json
[07:03:47] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:04:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:05:17] <wikibugs>	 (03PS5) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[07:05:41] <wikibugs>	 (03PS6) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[07:05:44] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to old templatelinks fields in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826773 (https://phabricator.wikimedia.org/T312865)
[07:06:51] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:08:11] <wikibugs>	 (03PS7) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[07:08:32] <wikibugs>	 (03PS8) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[07:10:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33223 and previous config saved to /var/cache/conftool/dbconfig/20220826-071039-ladsgroup.json
[07:11:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33224 and previous config saved to /var/cache/conftool/dbconfig/20220826-071131-root.json
[07:14:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33225 and previous config saved to /var/cache/conftool/dbconfig/20220826-071403-root.json
[07:14:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33226 and previous config saved to /var/cache/conftool/dbconfig/20220826-071424-root.json
[07:16:12] <wikibugs>	 (03PS9) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[07:16:14] <wikibugs>	 (03PS3) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388)
[07:16:25] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Ladsgroup) I add serviceops, I know it's a bit of stretch but that's the one that makes the most sense. Please change to another team if you think there is a better...
[07:16:40] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Ladsgroup) p:05Triage→03Medium
[07:16:46] <wikibugs>	 (03CR) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:17:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33227 and previous config saved to /var/cache/conftool/dbconfig/20220826-071751-root.json
[07:18:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:22:15] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 15% [puppet] - 10https://gerrit.wikimedia.org/r/826601 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[07:23:26] <jynus>	 denisse|m: the alerts "CRITICAL - degraded: The following units failed: logrotate.service" will be solved when T315393 is resolved, did I understood the ticket right?
[07:23:26] <stashbot>	 T315393: Logrotate is unable to rotate LibreNMS logs in the netmon instances due to insuficient permissions to read and write log files in /var/log/ - https://phabricator.wikimedia.org/T315393
[07:23:36] <vgutierrez>	 !log Increase roll-out of query-sorting to 15% - T314868
[07:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:41] <stashbot>	 T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868
[07:24:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2025.codfw.wmnet with OS bullseye
[07:24:17] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye
[07:24:17] <denisse|m>	 jynus: Yes, the patch is sent here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/826771
[07:24:45] <jynus>	 cool, thanks for the fix! I was looking at ongoing alerts
[07:25:15] <denisse|m>	 jynus: Sure thing, can I add you as a reviewer? :)
[07:25:26] <denisse|m>	 It may be a good idea so we can merge it now. ^^
[07:25:29] <jynus>	 sure
[07:25:34] <jynus>	 let me see
[07:25:42] <denisse|m>	 Possibly I could add you and moritzm as reviewers.
[07:25:45] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-worker1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33228 and previous config saved to /var/cache/conftool/dbconfig/20220826-072545-ladsgroup.json
[07:26:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33229 and previous config saved to /var/cache/conftool/dbconfig/20220826-072635-root.json
[07:27:11] <denisse|m>	 jynus, moritzm I added a comment describing the issue and the rationale for the fix in here: https://phabricator.wikimedia.org/T315393#8187595
[07:27:52] <jynus>	 thanks, that helps, it's been a long time since I edited logrotate config
[07:29:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[07:29:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33230 and previous config saved to /var/cache/conftool/dbconfig/20220826-072908-root.json
[07:29:25] <jynus>	 any worry, moritzm, regarding permissions ^
[07:29:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33231 and previous config saved to /var/cache/conftool/dbconfig/20220826-072929-root.json
[07:30:29] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:31:44] <jynus>	 the pcc may be referring to the wrong patch, though
[07:32:02] <denisse|m>	 Oh, let me take a double look at PCC.
[07:32:48] <jynus>	 I ran https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36986/
[07:32:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33232 and previous config saved to /var/cache/conftool/dbconfig/20220826-073256-root.json
[07:33:28] <denisse|m>	 jynus: Thanks, I ran it too. :P Let me stop this job: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36987/
[07:33:38] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) >>! In T315398#8167048, @ori wrote: > So 'powersave' with EPP=0 gives a broader range of operating frequencies than 'performance'. We should see if in th...
[07:34:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/36986/" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[07:34:48] <denisse|m>	 QQ jynus , do you know why the PCC says "no change"?
[07:35:09] <denisse|m>	 More specifically, I'm curious as to why it won't change the config file with that patch.
[07:35:32] <jynus>	 I'm not sure if puppet compiler tracks imported files, any only puppet resources
[07:35:47] <jynus>	 so it won't be show on the diff, but let me double check the code referring it
[07:36:03] <awight>	 I have some confusion about how to build event-driven systems on Wikimedia data.  Let's say I want to build a production-cluster service with a local replica of wikidata summaries in every language, to prevent causing a high load on the wikidata query service.  Do I watch the resource_change kafka topic, make an API request to wikidata to fetch every changed item, and then cache the summary 
[07:36:09] <awight>	 text?  This would shift the load onto the API server so I don't feel good about the idea.
[07:37:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[07:38:05] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:38:11] <jynus>	 denisse|m: yeah, I think it looks good to me
[07:38:30] <denisse|m>	 Thanks Jaime. :)
[07:38:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Configure logrotate to rotate logs as the 'librenms' user. [puppet] - 10https://gerrit.wikimedia.org/r/826771 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse)
[07:38:54] <jynus>	 merge and hopefully we can get rid of those 3 alarms! :-D
[07:39:38] <denisse|m>	 Merged, running puppet in the netmon instances.
[07:39:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:40:36] <wikibugs>	 (03PS4) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[07:40:40] <wikibugs>	 (03PS11) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[07:40:51] <jynus>	 the actuall puppet run should show the diff, however
[07:40:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33233 and previous config saved to /var/cache/conftool/dbconfig/20220826-074052-ladsgroup.json
[07:40:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[07:41:15] <denisse|m>	 jynus: I confirm that the files are updated after puppet run. 😉
[07:41:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[07:41:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage
[07:41:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33234 and previous config saved to /var/cache/conftool/dbconfig/20220826-074126-ladsgroup.json
[07:41:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1185 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33235 and previous config saved to /var/cache/conftool/dbconfig/20220826-074140-root.json
[07:42:01] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:42:02] <jynus>	 yeah, so I remembered there is some limitation with the compiler, but I didn't remember the details- it won't show diffs of imported files
[07:42:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33236 and previous config saved to /var/cache/conftool/dbconfig/20220826-074252-ladsgroup.json
[07:43:09] <denisse|m>	 I confirm that the alert is gone with the new config.
[07:43:17] <jynus>	 nice!
[07:44:00] <wikibugs>	 (03PS1) 10Majavah: P:toolforge:k8s:haproxy: increase 404 handler timeout [puppet] - 10https://gerrit.wikimedia.org/r/826779
[07:44:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33237 and previous config saved to /var/cache/conftool/dbconfig/20220826-074412-root.json
[07:44:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33238 and previous config saved to /var/cache/conftool/dbconfig/20220826-074434-root.json
[07:44:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2025.codfw.wmnet with reason: host reimage
[07:45:56] <wikibugs>	 (03CR) 10Ladsgroup: Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie)
[07:46:41] <jynus>	 as it was reminded on the last meeting, hopefully we can have a cleaner alerts (by fixing errors or with acked/dowtimed criticals) to improve the signal/noise ratio
[07:47:07] <jynus>	 *alerts dashboard
[07:47:48] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:48:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling after reboot', diff saved to https://phabricator.wikimedia.org/P33239 and previous config saved to /var/cache/conftool/dbconfig/20220826-074801-root.json
[07:48:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33240 and previous config saved to /var/cache/conftool/dbconfig/20220826-074905-ladsgroup.json
[07:49:12] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/826779 (owner: 10Majavah)
[07:58:03] <wikibugs>	 (03PS12) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[07:58:05] <wikibugs>	 (03PS1) 10David Caro: tox: use the default python3 for the system [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782
[08:01:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2025.codfw.wmnet with OS bullseye
[08:01:21] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye completed: - ganeti2025 (**PASS**)   - Downtimed on...
[08:02:20] <icinga-wm>	 RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:04:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33241 and previous config saved to /var/cache/conftool/dbconfig/20220826-080411-ladsgroup.json
[08:07:14] <wikibugs>	 (03PS1) 10Ori: Increase roll-out of query-sorting to 30% [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868)
[08:09:10] <wikibugs>	 (03PS5) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911)
[08:09:12] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785
[08:10:08] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:19] <wikibugs>	 (03PS6) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024)
[08:10:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[08:12:52] <wikibugs>	 (03CR) 10Matthias Mullie: Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie)
[08:13:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (owner: 10Vgutierrez)
[08:15:12] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:16:42] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33242 and previous config saved to /var/cache/conftool/dbconfig/20220826-081918-ladsgroup.json
[08:19:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[08:20:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[08:20:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2025.codfw.wmnet to cluster codfw and group D
[08:21:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2025.codfw.wmnet to cluster codfw and group D
[08:22:58] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785
[08:24:11] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616
[08:24:45] <wikibugs>	 (03PS7) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024)
[08:25:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) One off script for that, tested on netbox-next: https://netbox-next.wikimedia.org/ipam/fhrp-groups/ `lang=python,name=Move VRRP IPs to FHRP...
[08:26:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (owner: 10Vgutierrez)
[08:27:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[08:31:09] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785
[08:34:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33243 and previous config saved to /var/cache/conftool/dbconfig/20220826-083424-ladsgroup.json
[08:35:50] <wikibugs>	 (03PS5) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[08:39:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[08:43:26] <wikibugs>	 (03PS6) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[08:43:37] <wikibugs>	 (03PS1) 10Majavah: openstack: keystone: enable app credentials on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195)
[08:44:29] <wikibugs>	 (03CR) 10Ori: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36988/console" [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[08:44:39] <wikibugs>	 (03PS2) 10Majavah: openstack: keystone: enable app credentials on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195)
[08:44:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33244 and previous config saved to /var/cache/conftool/dbconfig/20220826-084441-ladsgroup.json
[08:45:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36990/console" [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah)
[08:46:04] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:46:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 30% [puppet] - 10https://gerrit.wikimedia.org/r/826783 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[08:47:08] <vgutierrez>	 !log Increase roll-out of query-sorting to 30% - T314868
[08:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:12] <stashbot>	 T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868
[08:49:14] <wikibugs>	 (03CR) 10FNegri: ceph.bootstrap_and_add: add support to change the osd class type (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:55:05] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM, but I'm still pretty new to our OpenStack setup and to OpenStack in general, so I'd love another pair of eyes." [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah)
[08:56:14] <wikibugs>	 10SRE, 10Search-Console-access-request: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada)
[08:59:08] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785
[08:59:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P33245 and previous config saved to /var/cache/conftool/dbconfig/20220826-085947-ladsgroup.json
[09:00:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Allow cookbooks to handle restarts based on running one of more commands [cookbooks] - 10https://gerrit.wikimedia.org/r/826798
[09:00:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Get rid of disable_coalescing() in Lua (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[09:02:45] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:05:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:08:51] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubestage: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:11:47] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:14:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P33246 and previous config saved to /var/cache/conftool/dbconfig/20220826-091454-ladsgroup.json
[09:18:10] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:20:06] <wikibugs>	 (03PS3) 10Clément Goubert: ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977)
[09:20:13] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+2] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:21:49] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:23:01] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616 (https://phabricator.wikimedia.org/T315911)
[09:24:28] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802
[09:26:03] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kubernetes: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:30:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33247 and previous config saved to /var/cache/conftool/dbconfig/20220826-093000-ladsgroup.json
[09:30:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:30:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[09:30:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33248 and previous config saved to /var/cache/conftool/dbconfig/20220826-093034-ladsgroup.json
[09:30:35] <wikibugs>	 (03PS1) 10Cathal Mooney: Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955)
[09:30:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33249 and previous config saved to /var/cache/conftool/dbconfig/20220826-093051-ladsgroup.json
[09:32:31] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:32:37] <wikibugs>	 (03PS2) 10Cathal Mooney: Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955)
[09:32:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "trafficserver: Enable origin coalescing in cp600[78]" [puppet] - 10https://gerrit.wikimedia.org/r/826616 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[09:33:31] <vgutierrez>	 !log disable origin coalescing in cp6007 and cp6008 - T315911
[09:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:36] <stashbot>	 T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911
[09:35:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33250 and previous config saved to /var/cache/conftool/dbconfig/20220826-093558-ladsgroup.json
[09:38:45] <wikibugs>	 (03CR) 10Hashar: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[09:39:43] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:40:03] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[09:41:17] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:43:38] <wikibugs>	 10SRE, 10DynamicPageList (Wikimedia), 10serviceops, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Ladsgroup) p:05High→03Medium We added max execution time of ten seconds to all DPL queries, that'd mitigate part of the risk, so I'm redu...
[09:43:54] <wikibugs>	 (03CR) 10Volans: "FYI comment inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[09:44:36] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] ml-serve: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:44:38] <wikibugs>	 (03PS5) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785
[09:46:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 (owner: 10Jcrespo)
[09:51:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P33251 and previous config saved to /var/cache/conftool/dbconfig/20220826-095104-ladsgroup.json
[09:51:21] <wikibugs>	 (03PS2) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174)
[09:51:39] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:51:56] <wikibugs>	 (03CR) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[09:53:59] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:17] <wikibugs>	 (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[09:55:39] <wikibugs>	 (03PS4) 10Clément Goubert: C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977)
[09:56:13] <vgutierrez>	 !log testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016
[09:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:17] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802
[09:56:23] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:57:03] <jynus>	 there was an increase on codfw requests, this alert will likely be noisy as codfw traffic starts ramping up
[09:57:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] "Just a rebase" [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[09:57:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Fix host db2130 removed from puppet by mistake [puppet] - 10https://gerrit.wikimedia.org/r/826802 (owner: 10Jcrespo)
[09:59:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Starting with Bullseye the systemd unit for systemd-logind uses ProtectSystem=strict, which doesn't work with HDFS and results in a failing systemd-logind service. [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123)
[10:01:07] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff)
[10:03:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Starting with Bullseye the systemd unit for systemd-logind uses ProtectSystem=strict, which doesn't work with HDFS and results in a failing systemd-logind service. [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff)
[10:04:47] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Added above patch to delegate this range to the WMCS name servers.  I hadn't checked the naming convention previously, I do actually...
[10:06:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P33252 and previous config saved to /var/cache/conftool/dbconfig/20220826-100611-ladsgroup.json
[10:06:34] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[10:09:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[10:10:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "All good! I also checked the generated config locally with Junoser" [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[10:10:46] <wikibugs>	 (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[10:11:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Exclude /mnt from systemd-logind restrictions on Bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123)
[10:12:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[10:13:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff)
[10:13:30] <vgutierrez>	 !log stop testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 in cp6016
[10:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:14] <wikibugs>	 (03PS1) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943)
[10:15:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[10:21:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T316186)', diff saved to https://phabricator.wikimedia.org/P33253 and previous config saved to /var/cache/conftool/dbconfig/20220826-102117-ladsgroup.json
[10:21:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10MoritzMuehlenhoff) Can you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/826806/ a shot on clouddumps? It should addre...
[10:23:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33254 and previous config saved to /var/cache/conftool/dbconfig/20220826-102334-ladsgroup.json
[10:25:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33255 and previous config saved to /var/cache/conftool/dbconfig/20220826-102510-ladsgroup.json
[10:25:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[10:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[10:29:45] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-worker1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826806 (https://phabricator.wikimedia.org/T316123) (owner: 10Muehlenhoff)
[10:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[10:33:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[10:33:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[10:36:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[10:36:55] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[10:37:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33256 and previous config saved to /var/cache/conftool/dbconfig/20220826-103707-ladsgroup.json
[10:43:00] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[10:44:11] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons.
[10:44:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33257 and previous config saved to /var/cache/conftool/dbconfig/20220826-104427-ladsgroup.json
[10:44:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[10:45:45] <wikibugs>	 (03PS6) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338)
[10:46:22] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10jcrespo)
[10:47:16] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez)
[10:47:21] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10Vgutierrez)
[10:47:39] <wikibugs>	 (03PS5) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942)
[10:47:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) An initial test of https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785/6/modules/profile/files/trafficserver/default.lua (PS6) in cp6016 triggered T316337
[10:48:23] <wikibugs>	 (03CR) 10Hashar: gerrit: allow nist kex algorithms on OpenSsh server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[10:48:41] <hashar>	 I am so overengineering things some time :)
[10:51:12] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium
[10:55:04] <wikibugs>	 (03CR) 10FNegri: "If I understand correctly, the advantage of this patch is that running 'tox' locally becomes faster because only one Python version is use" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[10:56:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[10:59:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P33258 and previous config saved to /var/cache/conftool/dbconfig/20220826-105934-ladsgroup.json
[10:59:42] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeteadly - https://phabricator.wikimedia.org/T316337 (10jcrespo) Preliminary working doc: https://docs.google.com/document/d/1Ka9MQB8OwdzAzJVfZuaIGo5VfnyRNRr_WxLPZ6YFMkE
[11:00:51] <wikibugs>	 (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:12:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33259 and previous config saved to /var/cache/conftool/dbconfig/20220826-111234-root.json
[11:12:46] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) Actually, let me not step on your toes. But if you can tolerate a short extension of this task, I would very much like to see this setting tested. I thin...
[11:13:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui)
[11:14:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P33260 and previous config saved to /var/cache/conftool/dbconfig/20220826-111440-ladsgroup.json
[11:15:09] <wikibugs>	 (03CR) 10FNegri: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:16:32] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Also just a  note on the setup of the WMCS DNS in general.  It seems BIND won't resolve any of these names because the CNAMEs on the...
[11:18:01] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:18:37] <wikibugs>	 (03CR) 10David Caro: tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:18:40] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:19:21] <moritzm>	 !log uploaded intel-microcode 3.20220510.1~wmf9u1 to apt.wikimedia.org
[11:19:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:05] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] tox: use the default python3 for the system (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/826782 (owner: 10David Caro)
[11:27:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33261 and previous config saved to /var/cache/conftool/dbconfig/20220826-112739-root.json
[11:29:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T316186)', diff saved to https://phabricator.wikimedia.org/P33262 and previous config saved to /var/cache/conftool/dbconfig/20220826-112946-ladsgroup.json
[11:29:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:30:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[11:33:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:33:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:33:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33263 and previous config saved to /var/cache/conftool/dbconfig/20220826-113347-ladsgroup.json
[11:35:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33264 and previous config saved to /var/cache/conftool/dbconfig/20220826-113511-ladsgroup.json
[11:37:18] <moritzm>	 !log installing intel-microcode updates on stretch hosts
[11:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33265 and previous config saved to /var/cache/conftool/dbconfig/20220826-114008-ladsgroup.json
[11:42:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33266 and previous config saved to /var/cache/conftool/dbconfig/20220826-114243-root.json
[11:51:58] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977)
[11:53:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36991/console" [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[11:53:39] <wikibugs>	 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) >>! In T315398#8187684, @ori wrote: >>>! In T315398#8167048, @ori wrote: >> So 'powersave' with EPP=0 gives a broader range of operating frequencie...
[11:54:16] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "The masters slipped through the cleanup, this fixes it." [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[11:55:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P33267 and previous config saved to /var/cache/conftool/dbconfig/20220826-115514-ladsgroup.json
[11:57:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33268 and previous config saved to /var/cache/conftool/dbconfig/20220826-115748-root.json
[11:58:45] <wikibugs>	 (03PS1) 10Clément Goubert: kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341)
[12:00:13] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:13] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:19] <wikibugs>	 (03PS1) 10Clément Goubert: ml-staging: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341)
[12:10:08] <wikibugs>	 (03PS1) 10Muehlenhoff: prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835
[12:10:18] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36993/console" [puppet] - 10https://gerrit.wikimedia.org/r/826833 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[12:10:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P33269 and previous config saved to /var/cache/conftool/dbconfig/20220826-121021-ladsgroup.json
[12:12:11] <wikibugs>	 (03PS1) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196)
[12:12:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33270 and previous config saved to /var/cache/conftool/dbconfig/20220826-121253-root.json
[12:14:22] <wikibugs>	 (03PS2) 10Btullis: Add a helmfile configuration for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196)
[12:19:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Join ARIN waiting list to request additional IPv4 resources. - https://phabricator.wikimedia.org/T288342 (10cmooney) a:03cmooney
[12:20:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Return AS43821 to RIPE - https://phabricator.wikimedia.org/T314471 (10cmooney) 05In progress→03Resolved This has been completed and records cleared up.
[12:21:02] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826840 (https://phabricator.wikimedia.org/T316341)
[12:21:58] <wikibugs>	 (03PS1) 10Muehlenhoff: elasticsearch::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826841
[12:22:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Complete testing of SONiC NOS / Dell network gear and write up - https://phabricator.wikimedia.org/T310901 (10cmooney) 05Open→03Resolved I'm going to close this task for now.  If, as seems likely, we wish to deploy Dell as an alternate vendor in production w...
[12:25:10] <wikibugs>	 (03CR) 10Btullis: "I've taken a copy of the ml-serve.yaml values to begin with, but removed some namespaces and edited the IP addresses etc for dse-k8s-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[12:25:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T316186)', diff saved to https://phabricator.wikimedia.org/P33271 and previous config saved to /var/cache/conftool/dbconfig/20220826-122527-ladsgroup.json
[12:26:07] <wikibugs>	 (03PS1) 10Clément Goubert: ml-serve: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826842 (https://phabricator.wikimedia.org/T316341)
[12:26:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826841 (owner: 10Muehlenhoff)
[12:27:15] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33272 and previous config saved to /var/cache/conftool/dbconfig/20220826-122758-root.json
[12:30:10] <wikibugs>	 (03PS1) 10FNegri: Add cloudcephosd1030 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826843 (https://phabricator.wikimedia.org/T314870)
[12:31:22] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Aklapper)
[12:31:44] <wikibugs>	 (03PS1) 10Btullis: We wish to upgrade datahub to version 0.8.43 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826844 (https://phabricator.wikimedia.org/T316336)
[12:31:51] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:31] <wikibugs>	 (03PS1) 10Clément Goubert: dse-k8s: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341)
[12:35:26] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::maps::tlsproxy: Unconditionally disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/826847
[12:37:18] <wikibugs>	 (03PS1) 10Clément Goubert: deployment-server: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826849 (https://phabricator.wikimedia.org/T316341)
[12:37:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33273 and previous config saved to /var/cache/conftool/dbconfig/20220826-123743-ladsgroup.json
[12:41:44] <wikibugs>	 (03PS1) 10Clément Goubert: releases: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826852 (https://phabricator.wikimedia.org/T316341)
[12:43:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33274 and previous config saved to /var/cache/conftool/dbconfig/20220826-124303-root.json
[12:47:33] <wikibugs>	 (03PS1) 10Clément Goubert: builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341)
[12:48:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826847 (owner: 10Muehlenhoff)
[12:52:45] <wikibugs>	 (03PS1) 10Muehlenhoff: varnish::common: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826855
[12:52:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P33275 and previous config saved to /var/cache/conftool/dbconfig/20220826-125250-ladsgroup.json
[12:58:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33276 and previous config saved to /var/cache/conftool/dbconfig/20220826-125808-root.json
[12:59:28] <wikibugs>	 (03PS1) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341)
[13:00:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[13:02:34] <wikibugs>	 (03PS1) 10Muehlenhoff: mariadb::config: Remove old tmpfile hack [puppet] - 10https://gerrit.wikimedia.org/r/826858
[13:02:52] <wikibugs>	 (03PS1) 10JMeybohm: Run helm dependency build before packaging [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347)
[13:03:03] <wikibugs>	 (03CR) 10Bking: [C: 03+2] deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking)
[13:05:21] <wikibugs>	 (03PS2) 10Clément Goubert: builder: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826853 (https://phabricator.wikimedia.org/T316341)
[13:07:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P33277 and previous config saved to /var/cache/conftool/dbconfig/20220826-130756-ladsgroup.json
[13:08:34] <wikibugs>	 (03CR) 10Clément Goubert: kubestage: Remove profile::docker::engine::version [puppet] - 10https://gerrit.wikimedia.org/r/826828 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[13:09:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox: externally-hosted NEL report forwarders for more timely report reception - https://phabricator.wikimedia.org/T292870 (10ayounsi)
[13:11:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/826792 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah)
[13:13:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33278 and previous config saved to /var/cache/conftool/dbconfig/20220826-131312-root.json
[13:14:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10ayounsi)
[13:14:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi)
[13:16:01] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) >>! In T315955#8188444, @cmooney wrote: > Also just a  note on the setup of the WMCS DNS in general. >  > It seems BIND won't resolve...
[13:16:36] <wikibugs>	 (03PS1) 10Clément Goubert: ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977)
[13:17:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb::config: Remove old tmpfile hack [puppet] - 10https://gerrit.wikimedia.org/r/826858 (owner: 10Muehlenhoff)
[13:18:20] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37002/console" [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[13:20:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10ayounsi)
[13:21:28] <wikibugs>	 (03PS3) 10JMeybohm: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943)
[13:21:30] <wikibugs>	 (03PS2) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943)
[13:22:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[13:23:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33279 and previous config saved to /var/cache/conftool/dbconfig/20220826-132304-ladsgroup.json
[13:23:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:23:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:23:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/826845 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[13:23:43] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[13:27:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[13:27:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33280 and previous config saved to /var/cache/conftool/dbconfig/20220826-132751-ladsgroup.json
[13:28:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33281 and previous config saved to /var/cache/conftool/dbconfig/20220826-132817-root.json
[13:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:30:12] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/826862
[13:30:45] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/826862 (owner: 10Marostegui)
[13:32:10] <wikibugs>	 (03CR) 10Clément Goubert: ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[13:33:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33283 and previous config saved to /var/cache/conftool/dbconfig/20220826-133318-ladsgroup.json
[13:34:12] <wikibugs>	 (03CR) 10JMeybohm: "Pipeline fails with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[13:36:02] <wikibugs>	 (03PS3) 10JMeybohm: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943)
[13:39:38] <wikibugs>	 (03PS1) 10Muehlenhoff: codesearch: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826864
[13:41:54] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911)
[13:42:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[13:43:07] <wikibugs>	 (03PS1) 10Muehlenhoff: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867
[13:43:42] <wikibugs>	 (03CR) 10JMeybohm: R:profile::docker::engine::version removal and cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[13:44:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[13:44:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33284 and previous config saved to /var/cache/conftool/dbconfig/20220826-134426-ladsgroup.json
[13:44:42] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911)
[13:44:59] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:26] <wikibugs>	 (03PS2) 10Muehlenhoff: rancid: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826867
[13:47:25] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:32] <wikibugs>	 (03PS3) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341)
[13:48:50] <wikibugs>	 (03CR) 10Clément Goubert: R:profile::docker::engine::version removal and cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826856 (https://phabricator.wikimedia.org/T316341) (owner: 10Clément Goubert)
[13:49:35] <wikibugs>	 (03PS1) 10Muehlenhoff: routinator: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/826869
[13:53:23] <icinga-wm>	 PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:56:55] <wikibugs>	 (03PS1) 10BBlack: WIP - send caching attribute to BE layer [puppet] - 10https://gerrit.wikimedia.org/r/826871
[13:58:53] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911)
[13:59:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P33285 and previous config saved to /var/cache/conftool/dbconfig/20220826-135932-ladsgroup.json
[14:00:41] <wikibugs>	 (03PS2) 10BBlack: WIP - send caching attribute to BE layer [puppet] - 10https://gerrit.wikimedia.org/r/826871
[14:06:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kubernetes: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826827 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[14:06:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] ml-staging: finish profile::docker::storage cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826860 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[14:10:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 20+ - https://phabricator.wikimedia.org/T295690 (10Volans)
[14:12:08] <wikibugs>	 10SRE, 10Image-Suggestions, 10serviceops: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10lbowmaker)
[14:13:03] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T315911)
[14:13:21] <wikibugs>	 10SRE, 10Image-Suggestions, 10serviceops, 10Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10lbowmaker)
[14:14:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P33286 and previous config saved to /var/cache/conftool/dbconfig/20220826-141438-ladsgroup.json
[14:18:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[14:25:30] <wikibugs>	 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10lbowmaker)
[14:29:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33288 and previous config saved to /var/cache/conftool/dbconfig/20220826-142945-ladsgroup.json
[14:34:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33289 and previous config saved to /var/cache/conftool/dbconfig/20220826-143402-ladsgroup.json
[14:38:43] <jynus>	 !log rolling restart of backup1004-9, backup2004-9
[14:38:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:15] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:43:39] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dse-k8s-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:47:45] <icinga-wm>	 RECOVERY - HP RAID on ms-be1054 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:48:48] <wikibugs>	 (03PS5) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338)
[14:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P33290 and previous config saved to /var/cache/conftool/dbconfig/20220826-144908-ladsgroup.json
[14:51:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Failed disk in ms-be1066 - https://phabricator.wikimedia.org/T314143 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Replaced failed Drive
[14:52:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10Jclark-ctr) 05Open→03Resolved Replaced failed Drive
[14:54:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) @MatthewVernon  Can these be swapped at anytime?
[14:54:13] <icinga-wm>	 RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:00:18] <wikibugs>	 (03CR) 10Volans: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[15:03:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) p:05Triage→03Medium
[15:04:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P33291 and previous config saved to /var/cache/conftool/dbconfig/20220826-150415-ladsgroup.json
[15:04:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10jcrespo) @Jclark-ctr , Matthew is away on vacations- but I may be able to help you, do you need to shutdown the server for the disk change?
[15:08:52] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): decom cookbook often fails to wipe drives in HP systems - https://phabricator.wikimedia.org/T316292 (10Volans) 05Open→03Invalid The error reported in T316285#8186856 clearly states:  > **Unable to connect to the host, wipe of swra...
[15:10:50] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) >>! In T316285#8187029, @Andrew wrote: > @cmjohnson, this is another host that will need its drives wiped, as the cookbook seems...
[15:19:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33292 and previous config saved to /var/cache/conftool/dbconfig/20220826-151921-ladsgroup.json
[15:19:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[15:19:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[15:19:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:19:53] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Jclark-ctr) contint1002 B1 U38  port38   cableid 23000029
[15:19:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Jclark-ctr)
[15:19:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:20:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33293 and previous config saved to /var/cache/conftool/dbconfig/20220826-152003-ladsgroup.json
[15:23:52] <wikibugs>	 (03CR) 10FNegri: "I did read the Phab task but I still have a couple questions ;)" [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[15:29:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[01] - https://phabricator.wikimedia.org/T313873 (10Jclark-ctr) @akosiaris  Can you verify host names?   kubernetes102[01] Already in use  Racking task T290202
[15:30:56] <wikibugs>	 (03PS6) 10Vgutierrez: varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338)
[15:32:05] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Emit X-Varnish-Cluster for misc sites [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[15:34:54] <wikibugs>	 (03CR) 10Vgutierrez: "varnishtest is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/826866 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[15:38:02] <wikibugs>	 (03PS1) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582)
[15:38:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[15:41:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10JAnstee_WMF) I checked today following the switch overnight - It seems we are still able to send invites, and it is still sending to spam via qualtri...
[15:41:55] <wikibugs>	 (03PS2) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582)
[15:42:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[15:46:05] <wikibugs>	 (03PS7) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338)
[15:46:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) @jcrespo  just wanted to make sure drives are able just be replaced they are hotswapable just want to verify prior to replacing
[15:47:04] <wikibugs>	 (03PS8) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338)
[15:50:47] <jynus>	 !log rolling restart of ms-backup1001,2, ms-backup2001,2
[15:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:03] <wikibugs>	 (03PS9) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338)
[15:52:41] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/826785 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[15:56:33] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296)
[15:57:48] <wikibugs>	 (03PS1) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[15:59:18] <wikibugs>	 (03PS2) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[16:01:24] <wikibugs>	 (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/826884 it can't get traffic anymore" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[16:01:57] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: remove search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296)
[16:20:10] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/826847 (owner: 10Muehlenhoff)
[16:20:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33295 and previous config saved to /var/cache/conftool/dbconfig/20220826-162019-ladsgroup.json
[16:21:36] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this not g...
[16:35:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P33296 and previous config saved to /var/cache/conftool/dbconfig/20220826-163525-ladsgroup.json
[16:36:03] <wikibugs>	 (03PS3) 10Sbisson: Explicit config for Wikistories discovery module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826882 (https://phabricator.wikimedia.org/T314582)
[16:40:15] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10mpopov) > The first change above would remove it from ATS (trafficserver) config. That would be a one-line change that would result in this no...
[16:50:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P33297 and previous config saved to /var/cache/conftool/dbconfig/20220826-165032-ladsgroup.json
[16:56:12] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@5d95fe5]: Add job for MediaWiki history dumps.
[16:56:25] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@5d95fe5]: Add job for MediaWiki history dumps. (duration: 00m 13s)
[16:59:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: allow nist kex algorithms on OpenSsh server [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[17:03:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "sshd has been refreshed by puppet on both gerrit servers, I can still ssh to them and watching replication.log everything looks normal" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[17:04:31] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:05:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33298 and previous config saved to /var/cache/conftool/dbconfig/20220826-170538-ladsgroup.json
[17:05:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[17:05:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[17:06:45] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:08:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[17:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[17:09:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33299 and previous config saved to /var/cache/conftool/dbconfig/20220826-170911-ladsgroup.json
[17:16:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33300 and previous config saved to /var/cache/conftool/dbconfig/20220826-171638-ladsgroup.json
[17:28:54] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10phaultfinder)
[17:30:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10KFrancis) @Ladsgroup I am confirming the signed NDA.  Please proceed with the access request!  Thanks!
[17:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P33301 and previous config saved to /var/cache/conftool/dbconfig/20220826-173144-ladsgroup.json
[17:44:12] <wikibugs>	 (03PS1) 10Dzahn: admin: add Jonathan Fraine to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044)
[17:46:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P33302 and previous config saved to /var/cache/conftool/dbconfig/20220826-174651-ladsgroup.json
[18:01:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T316186)', diff saved to https://phabricator.wikimedia.org/P33303 and previous config saved to /var/cache/conftool/dbconfig/20220826-180157-ladsgroup.json
[18:02:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[18:02:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[18:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33304 and previous config saved to /var/cache/conftool/dbconfig/20220826-180223-ladsgroup.json
[18:04:20] <wikibugs>	 (03CR) 10Ssingh: "Looking at e5b62c8e9d0, it seems like we added docker tests on purpose. That further links to https://phabricator.wikimedia.org/T286639 an" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall)
[18:05:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] varnish::common: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826855 (owner: 10Muehlenhoff)
[18:09:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33305 and previous config saved to /var/cache/conftool/dbconfig/20220826-180943-ladsgroup.json
[18:24:18] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: add Jonathan Fraine to ldap_only_admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044) (owner: 10Dzahn)
[18:24:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P33306 and previous config saved to /var/cache/conftool/dbconfig/20220826-182450-ladsgroup.json
[18:26:36] <wikibugs>	 (03CR) 10Dzahn: "thanks, added to LDAP groups on mwmaint1002" [puppet] - 10https://gerrit.wikimedia.org/r/826895 (https://phabricator.wikimedia.org/T316044) (owner: 10Dzahn)
[18:28:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Dzahn)
[18:30:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Ladsgroup) 05Open→03Resolved a:03Dzahn Daniel did most of the work :)
[18:31:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for jdfraine - https://phabricator.wikimedia.org/T316044 (10Dzahn) @jdfraine You have been added to the same groups as other WMDE employees. The logins (and Gerrit privileges) should work now.
[18:33:57] <wikibugs>	 (03CR) 10Dzahn: "link courtesy of @Urbanecm (thanks): https://w.wiki/5d3s (843 hits in 30 days but sampled)" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[18:35:35] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) link courtesy of @Urbanecm (thanks): https://w.wiki/5d3s (843 hits in 30 days but sampled)
[18:38:42] <wikibugs>	 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Dzahn) An alternative way to shut it down would be to remove it first from DNS and later do everything else.  Then potential users would just...
[18:39:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P33307 and previous config saved to /var/cache/conftool/dbconfig/20220826-183956-ladsgroup.json
[18:40:34] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@c5f46a4]: (no justification provided)
[18:40:44] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@c5f46a4]: (no justification provided) (duration: 00m 10s)
[18:55:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T316186)', diff saved to https://phabricator.wikimedia.org/P33308 and previous config saved to /var/cache/conftool/dbconfig/20220826-185502-ladsgroup.json
[18:55:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[18:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33309 and previous config saved to /var/cache/conftool/dbconfig/20220826-185527-ladsgroup.json
[19:01:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33310 and previous config saved to /var/cache/conftool/dbconfig/20220826-190151-ladsgroup.json
[19:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P33311 and previous config saved to /var/cache/conftool/dbconfig/20220826-191657-ladsgroup.json
[19:20:29] <wikibugs>	 (03CR) 10Dzahn: "opinions what's better - remove from ATS first or remove from DNS first? ("unknown domain" error page  vs. NXDOMAIN)?" [puppet] - 10https://gerrit.wikimedia.org/r/826884 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[19:32:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P33312 and previous config saved to /var/cache/conftool/dbconfig/20220826-193203-ladsgroup.json
[19:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T316186)', diff saved to https://phabricator.wikimedia.org/P33313 and previous config saved to /var/cache/conftool/dbconfig/20220826-194709-ladsgroup.json
[19:47:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:47:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:47:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33314 and previous config saved to /var/cache/conftool/dbconfig/20220826-194734-ladsgroup.json
[19:47:35] <wikibugs>	 (03CR) 10Dzahn: "this is already not a virtual host on cluster apache anymore ([mwdebug1001:/] $ sudo apache2ctl -S | grep vhost)" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[19:48:17] <wikibugs>	 (03PS3) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[19:53:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33315 and previous config saved to /var/cache/conftool/dbconfig/20220826-195351-ladsgroup.json
[19:55:25] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P33316 and previous config saved to /var/cache/conftool/dbconfig/20220826-200858-ladsgroup.json
[20:16:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) This is the process I followed to update the UID/GID:  1. Backup...
[20:19:43] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:24:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P33318 and previous config saved to /var/cache/conftool/dbconfig/20220826-202404-ladsgroup.json
[20:32:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] c:spamassassin remove cronjob, and use systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[20:33:41] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-poller-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:34:16] <denisse|m>	 ^ Woking on that one.
[20:34:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "otrs1001 -   Process: 29134 ExecStart=/usr/local/sbin/spamassassin_updates (code=exited, status=0/SUCCESS)" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[20:34:20] <denisse|m>	 I've already found the issue.
[20:34:34] <mutante>	 denisse|m: great! just started to wonder if that is the one
[20:34:53] <mutante>	 I think you can also merge your change. it has +1 now
[20:36:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thank you Slyngshede - looks good now on otrs1001 - will let you know if I ever see it again" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[20:37:19] <mutante>	 denisse|m: oh, it seems like you are pointing out that I should have also done it for both UID and GID
[20:37:31] <mutante>	 for 920
[20:38:33] <denisse|m>	 Oh, I matched the UID and GID to be the same mostly for simplicity. :)
[20:38:57] <mutante>	 I mean how in https://gerrit.wikimedia.org/r/c/operations/puppet/+/826427/9/modules/admin/data/data.yaml  you are adding it in 2 locations
[20:39:05] <mutante>	 one of them has 920 above it and the other one does not
[20:39:10] <mutante>	 920 was added by me
[20:39:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T316186)', diff saved to https://phabricator.wikimedia.org/P33319 and previous config saved to /var/cache/conftool/dbconfig/20220826-203910-ladsgroup.json
[20:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[20:39:18] <mutante>	 looks like I forgot one of 2 places
[20:39:24] <mutante>	 but why is it duplicated
[20:39:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[20:39:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33320 and previous config saved to /var/cache/conftool/dbconfig/20220826-203935-ladsgroup.json
[20:39:51] <denisse|m>	 The 'librenms-poller-all' is working again in netmon1003.
[20:39:55] <mutante>	 cool
[20:40:19] <denisse|m>	 Ah, I get what you mean. From what I understood in that file one part reserves the GID globally while the other one is specific for UID. :)
[20:40:26] <denisse|m>	 That's why I added it in both places.
[20:40:47] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:58] <mutante>	 seems like I need to send a fix and you helped me see it
[20:41:17] <denisse|m>	 Awesome! :D
[20:41:25] <mutante>	 but it is kind of the same thing in 2 places it feels
[20:41:44] <mutante>	 gid and then uid:gid
[20:43:02] <wikibugs>	 (03PS4) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388)
[20:43:22] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2] netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[20:43:25] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[20:44:11] <denisse|m>	 Yeah, I agree that it feels the same. :/
[20:44:11] <denisse|m>	 We may be able to shrink it into a single section...
[20:45:06] <wikibugs>	 (03PS10) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[20:45:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33321 and previous config saved to /var/cache/conftool/dbconfig/20220826-204555-ladsgroup.json
[20:46:42] <wikibugs>	 (03PS1) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360)
[20:47:09] <mutante>	 yea, but until then I will make it consistent
[20:48:17] <wikibugs>	 (03PS2) 10Dzahn: admin: add reserved gid 920 for phd, phabricator user [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360)
[20:48:19] <wikibugs>	 (03PS11) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[20:48:45] <wikibugs>	 (03CR) 10Dzahn: "It does feel a bit duplicate though, doesn't it?" [puppet] - 10https://gerrit.wikimedia.org/r/826915 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[21:00:11] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P33322 and previous config saved to /var/cache/conftool/dbconfig/20220826-210102-ladsgroup.json
[21:03:17] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:03:47] <wikibugs>	 (03CR) 10RLazarus: "Great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:03:55] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse)
[21:05:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[21:10:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse)
[21:11:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse)
[21:12:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse)
[21:16:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P33323 and previous config saved to /var/cache/conftool/dbconfig/20220826-211608-ladsgroup.json
[21:17:57] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919
[21:19:48] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37008/console" [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus)
[21:20:36] <wikibugs>	 (03CR) 10RLazarus: httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus)
[21:30:46] <wikibugs>	 (03CR) 10RLazarus: httpbb: drop tests for search.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:31:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T316186)', diff saved to https://phabricator.wikimedia.org/P33324 and previous config saved to /var/cache/conftool/dbconfig/20220826-213115-ladsgroup.json
[21:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[21:31:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[21:31:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33325 and previous config saved to /var/cache/conftool/dbconfig/20220826-213140-ladsgroup.json
[21:34:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm! https://puppet-compiler.wmflabs.org/pcc-worker1001/37008/cumin2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus)
[21:35:36] <wikibugs>	 (03CR) 10Dzahn: httpbb: drop tests for search.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:35:48] <wikibugs>	 (03PS4) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[21:37:08] <wikibugs>	 (03PS5) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[21:38:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33326 and previous config saved to /var/cache/conftool/dbconfig/20220826-213801-ladsgroup.json
[21:39:11] <wikibugs>	 (03PS6) 10Dzahn: httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296)
[21:44:37] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Add $ensure to httpbb::test_suite [puppet] - 10https://gerrit.wikimedia.org/r/826919 (owner: 10RLazarus)
[21:45:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] httpbb: drop tests for search.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:49:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37009/cumin2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:50:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: drop tests for search.wikimedia.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:51:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Profile::Httpbb/Httpbb::Test_suite[apple-search/test_search.yaml]/File[/srv/deployment/httpbb-tests/apple-search/test" [puppet] - 10https://gerrit.wikimedia.org/r/826885 (https://phabricator.wikimedia.org/T316296) (owner: 10Dzahn)
[21:53:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P33327 and previous config saved to /var/cache/conftool/dbconfig/20220826-215307-ladsgroup.json
[21:53:42] <wikibugs>	 (03PS1) 10Dzahn: httpbb: remove absented file test_search [puppet] - 10https://gerrit.wikimedia.org/r/826923
[21:54:54] <wikibugs>	 (03CR) 10Dzahn: "theoretically this could exist in cloud VPS though, I can wait and just merge it later" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn)
[22:07:47] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] httpbb: remove absented file test_search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn)
[22:08:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118', diff saved to https://phabricator.wikimedia.org/P33328 and previous config saved to /var/cache/conftool/dbconfig/20220826-220814-ladsgroup.json
[22:22:15] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:23:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2118 (T316186)', diff saved to https://phabricator.wikimedia.org/P33329 and previous config saved to /var/cache/conftool/dbconfig/20220826-222320-ladsgroup.json
[22:23:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:23:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:23:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33330 and previous config saved to /var/cache/conftool/dbconfig/20220826-222345-ladsgroup.json
[22:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33331 and previous config saved to /var/cache/conftool/dbconfig/20220826-222409-ladsgroup.json
[22:30:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33332 and previous config saved to /var/cache/conftool/dbconfig/20220826-223021-ladsgroup.json
[22:41:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: set ecs routing only when the output is logstash [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite)
[22:45:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P33333 and previous config saved to /var/cache/conftool/dbconfig/20220826-224527-ladsgroup.json
[23:00:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P33334 and previous config saved to /var/cache/conftool/dbconfig/20220826-230033-ladsgroup.json
[23:08:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "every puppetmaster should have run by now" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn)
[23:09:33] <mutante>	  /away (https://en.wikipedia.org/wiki/Thank_God_It%27s_Friday)
[23:10:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on cumin2002. laters" [puppet] - 10https://gerrit.wikimedia.org/r/826923 (owner: 10Dzahn)
[23:14:45] <icinga-wm>	 PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:15:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33335 and previous config saved to /var/cache/conftool/dbconfig/20220826-231540-ladsgroup.json
[23:18:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T316186)', diff saved to https://phabricator.wikimedia.org/P33336 and previous config saved to /var/cache/conftool/dbconfig/20220826-231856-ladsgroup.json
[23:34:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P33337 and previous config saved to /var/cache/conftool/dbconfig/20220826-233402-ladsgroup.json
[23:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P33338 and previous config saved to /var/cache/conftool/dbconfig/20220826-234908-ladsgroup.json