[00:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[00:00:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS buster
[00:09:41] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:10:29] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[00:11:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P36934 and previous config saved to /var/cache/conftool/dbconfig/20221028-001124-ladsgroup.json
[00:22:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1067 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:26:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[00:26:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36935 and previous config saved to /var/cache/conftool/dbconfig/20221028-002631-ladsgroup.json
[00:26:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[00:26:37] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[00:26:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[00:26:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[00:27:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[00:27:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36936 and previous config saved to /var/cache/conftool/dbconfig/20221028-002708-ladsgroup.json
[00:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36937 and previous config saved to /var/cache/conftool/dbconfig/20221028-002816-ladsgroup.json
[00:29:55] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cp4040, phab1004, releases1002, releases2002, relforge1003, relforge1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[00:31:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage
[00:38:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:39:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1067 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:43:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36938 and previous config saved to /var/cache/conftool/dbconfig/20221028-004322-ladsgroup.json
[00:48:44] <wikibugs>	 (03PS1) 10Ssingh: cp4048: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850307 (https://phabricator.wikimedia.org/T317244)
[00:49:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:57:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS buster
[00:58:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4048: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850307 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[00:58:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36939 and previous config saved to /var/cache/conftool/dbconfig/20221028-005829-ladsgroup.json
[00:58:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS buster
[00:59:06] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4048.ulsfo.wmnet with OS buster
[01:00:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS buster
[01:09:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1067 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:13:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36940 and previous config saved to /var/cache/conftool/dbconfig/20221028-011335-ladsgroup.json
[01:13:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[01:13:41] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[01:13:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[01:13:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36941 and previous config saved to /var/cache/conftool/dbconfig/20221028-011357-ladsgroup.json
[01:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:15:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36942 and previous config saved to /var/cache/conftool/dbconfig/20221028-011505-ladsgroup.json
[01:18:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:23:05] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:26:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[01:27:13] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[01:29:43] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[01:30:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36943 and previous config saved to /var/cache/conftool/dbconfig/20221028-013011-ladsgroup.json
[01:30:15] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage
[01:33:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (16) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:12] <icinga-wm>	 PROBLEM - DNS on labstore1007.mgmt is CRITICAL: Domain labstore1007.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:41:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - no problem, we can scrap the idea of having a "recycled status" in Netbox.  For everything that gets deleted in Netbox, is there any feature or anyt...
[01:45:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36944 and previous config saved to /var/cache/conftool/dbconfig/20221028-014517-ladsgroup.json
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:28] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:48] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS buster
[02:00:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36945 and previous config saved to /var/cache/conftool/dbconfig/20221028-020024-ladsgroup.json
[02:00:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[02:00:31] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[02:00:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[02:00:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[02:00:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[02:00:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[02:01:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[02:01:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36946 and previous config saved to /var/cache/conftool/dbconfig/20221028-020117-ladsgroup.json
[02:04:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36947 and previous config saved to /var/cache/conftool/dbconfig/20221028-020425-ladsgroup.json
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:39] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:49] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:19:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36948 and previous config saved to /var/cache/conftool/dbconfig/20221028-021932-ladsgroup.json
[02:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36949 and previous config saved to /var/cache/conftool/dbconfig/20221028-023438-ladsgroup.json
[02:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:49:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36950 and previous config saved to /var/cache/conftool/dbconfig/20221028-024944-ladsgroup.json
[02:49:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[02:49:51] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[02:50:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[02:50:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36951 and previous config saved to /var/cache/conftool/dbconfig/20221028-025006-ladsgroup.json
[02:51:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36952 and previous config saved to /var/cache/conftool/dbconfig/20221028-025113-ladsgroup.json
[03:05:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[03:06:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36953 and previous config saved to /var/cache/conftool/dbconfig/20221028-030620-ladsgroup.json
[03:12:33] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[03:14:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:17:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[03:17:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[03:21:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36954 and previous config saved to /var/cache/conftool/dbconfig/20221028-032127-ladsgroup.json
[03:24:51] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[03:30:05] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:03] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36955 and previous config saved to /var/cache/conftool/dbconfig/20221028-033633-ladsgroup.json
[03:36:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[03:36:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[03:36:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36956 and previous config saved to /var/cache/conftool/dbconfig/20221028-033654-ladsgroup.json
[03:39:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36957 and previous config saved to /var/cache/conftool/dbconfig/20221028-033902-ladsgroup.json
[03:39:09] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[03:42:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[03:54:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36958 and previous config saved to /var/cache/conftool/dbconfig/20221028-035409-ladsgroup.json
[04:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[04:09:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36959 and previous config saved to /var/cache/conftool/dbconfig/20221028-040915-ladsgroup.json
[04:17:41] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:24:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36960 and previous config saved to /var/cache/conftool/dbconfig/20221028-042421-ladsgroup.json
[04:24:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[04:24:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[04:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36961 and previous config saved to /var/cache/conftool/dbconfig/20221028-042443-ladsgroup.json
[04:24:45] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[04:25:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36962 and previous config saved to /var/cache/conftool/dbconfig/20221028-042550-ladsgroup.json
[04:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36963 and previous config saved to /var/cache/conftool/dbconfig/20221028-044057-ladsgroup.json
[04:52:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[04:53:25] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289)
[04:56:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36964 and previous config saved to /var/cache/conftool/dbconfig/20221028-045603-ladsgroup.json
[04:57:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[05:06:22] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-10-27-102021-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/850315 (https://phabricator.wikimedia.org/T225494)
[05:06:51] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:11:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36965 and previous config saved to /var/cache/conftool/dbconfig/20221028-051110-ladsgroup.json
[05:11:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[05:11:17] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[05:11:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[05:19:55] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[05:25:37] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[05:34:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:49:45] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:07:41] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:18:13] <wikibugs>	 (03PS2) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440)
[06:28:57] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:41:17] <wikibugs>	 (03CR) 10Hashar: "Patchset 2 is a rebase I have send back to Gerrit by mistake." [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar)
[06:48:57] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:50:41] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221028T0700)
[07:04:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond)
[07:29:51] <icinga-wm>	 RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:43:18] <wikibugs>	 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) p:05Triage→03High
[07:49:05] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:49:07] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan)
[07:49:53] <wikibugs>	 (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan)
[07:50:50] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] "Thanks for updating this!" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[07:56:01] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772)
[07:56:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[07:57:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[07:57:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[07:58:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:00:10] <jinxer-wm>	 (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[08:01:13] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms
[08:17:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:20:06] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:21:25] <wikibugs>	 (03PS1) 10Cathal Mooney: Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411
[08:24:40] <wikibugs>	 (03Abandoned) 10Vgutierrez: ATS: Limit NUMA nodes usage on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/666871 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez)
[08:25:23] <wikibugs>	 (03Abandoned) 10Vgutierrez: Backport several fixes scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[08:26:29] <wikibugs>	 (03Abandoned) 10Vgutierrez: ATS: Clean libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578179 (owner: 10Vgutierrez)
[08:26:34] <wikibugs>	 (03Abandoned) 10Vgutierrez: ATS: Remove libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578180 (owner: 10Vgutierrez)
[08:26:52] <wikibugs>	 (03Abandoned) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez)
[08:27:16] <wikibugs>	 (03Abandoned) 10Vgutierrez: Revert PR #7465 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/821777 (owner: 10Vgutierrez)
[08:27:29] <wikibugs>	 (03Abandoned) 10Vgutierrez: Release 8.0.7-rc0-1wm3asan [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/590993 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez)
[08:28:29] <wikibugs>	 (03Abandoned) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez)
[08:29:23] <icinga-wm>	 RECOVERY - Host netflow1002 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms
[08:29:23] <icinga-wm>	 PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:08] <wikibugs>	 (03CR) 10Vgutierrez: prometheus: Add ats header/body size total metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall)
[08:34:55] <jinxer-wm>	 (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported
[08:35:56] <wikibugs>	 (03CR) 10Vgutierrez: "looks good, please provide a VTC for this." [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[08:37:59] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[08:39:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37825/console" [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[08:40:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869 (10Peachey88)
[08:40:44] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) p:05Triage→03Low
[08:41:55] <wikibugs>	 (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[08:42:31] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "looks good, but profile::cache::haproxy::unified_certs needs to be updated to let the certs actually be deployed on the cp servers." [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[08:42:44] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 04-1] Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[08:42:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti4006 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/850413 (https://phabricator.wikimedia.org/T317247)
[08:44:07] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>! @jhathaway  > Thanks Brian for bringing up some alternative ideas! >>>! @bking  >> I wonder if our energies might be better spent searching for >> alt...
[08:44:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4006 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/850413 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[08:47:06] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>! @bking  > Thanks! Your perspective as a both a Puppet expert and relative n00b like me is very much appreciated.  I hope you (and everyone else) will...
[08:49:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>>>! @bking  >> I agree, it will be very time-consuming and painful to move off Puppet. But the current situation also seems painful and untenable.    >>...
[08:49:48] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37826/console" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[08:49:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] readme: Add general notes for testing deps [software/acme-chief] - 10https://gerrit.wikimedia.org/r/848512 (owner: 10BCornwall)
[08:50:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:51:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 (owner: 10Krinkle)
[08:54:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411 (owner: 10Cathal Mooney)
[08:55:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on netflow1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:55:08] <wikibugs>	 (03Merged) 10jenkins-bot: Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411 (owner: 10Cathal Mooney)
[08:55:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet
[08:56:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2334 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[09:02:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:03:00] <wikibugs>	 (03PS1) 10Vgutierrez: aptrepo: Add thirdparty/haproxy26 [puppet] - 10https://gerrit.wikimedia.org/r/850416 (https://phabricator.wikimedia.org/T321775)
[09:03:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet
[09:05:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster eqiad and group A
[09:05:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster eqiad and group A
[09:05:32] <wikibugs>	 (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:05:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1
[09:05:58] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1
[09:10:16] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775)
[09:10:44] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37827/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:11:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[09:13:16] <wikibugs>	 (03CR) 10JMeybohm: "This will require the operator service account to have privileges to create secrets in the cluster/namespace where is runs. But from what " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:13:21] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775)
[09:13:33] <wikibugs>	 (03PS1) 10Klausman: wikilabels: maybe get the tuning.conf source part right [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389)
[09:14:24] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37828/console" [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:14:45] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] wikilabels: maybe get the tuning.conf source part right [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:15:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance
[09:15:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance
[09:17:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1
[09:18:05] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10MatthewVernon) [I'm not saying we should move to Ansible necessarily, but wanted to respond to something said up-thread :)]  I've used Ansible a fair amount in p...
[09:18:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[09:18:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[09:18:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[09:18:59] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1
[09:19:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[09:19:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36967 and previous config saved to /var/cache/conftool/dbconfig/20221028-091912-marostegui.json
[09:19:18] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[09:20:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37829/console" [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[09:21:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36968 and previous config saved to /var/cache/conftool/dbconfig/20221028-092125-marostegui.json
[09:23:28] <wikibugs>	 (03PS1) 10Klausman: wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389)
[09:25:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:30:45] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775)
[09:31:29] <wikibugs>	 (03CR) 10Elukey: wikilabels: actually install Postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:32:36] <wikibugs>	 (03CR) 10Klausman: wikilabels: actually install Postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:33:29] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37831/console" [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[09:34:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:36:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P36969 and previous config saved to /var/cache/conftool/dbconfig/20221028-093631-marostegui.json
[09:37:20] <wikibugs>	 (03Abandoned) 10Klausman: wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman)
[09:40:20] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. I left one suggestion in-line. We can also refactor this later if you both agree, this is not blocking." [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[09:41:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) 05Open→03In progress
[09:45:12] <wikibugs>	 (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[09:46:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow1002.eqiad.wmnet
[09:51:09] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421
[09:51:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P36970 and previous config saved to /var/cache/conftool/dbconfig/20221028-095138-marostegui.json
[09:52:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup)
[09:52:41] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup)
[09:53:36] <moritzm>	 !log drain ganeti4003 for eventual decom T317247
[09:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:42] <stashbot>	 T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247
[09:53:54] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup)
[09:56:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow1002.eqiad.wmnet
[09:56:41] <icinga-wm>	 PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36971 and previous config saved to /var/cache/conftool/dbconfig/20221028-100644-marostegui.json
[10:06:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[10:06:51] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[10:07:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[10:07:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36972 and previous config saved to /var/cache/conftool/dbconfig/20221028-100706-marostegui.json
[10:09:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36973 and previous config saved to /var/cache/conftool/dbconfig/20221028-100918-marostegui.json
[10:11:28] <wikibugs>	 (03PS1) 10Matthias Mullie: Enable ImageSuggestions on ca, no, fi & hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064)
[10:12:26] <wikibugs>	 (03PS2) 10Matthias Mullie: Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064)
[10:13:35] <wikibugs>	 (03PS1) 10Matthias Mullie: Schedule image suggestions for ca, no, fi & huwiki [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064)
[10:13:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "lgtm" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik)
[10:14:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[10:14:43] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-1] "Not to be merged until confirmed with the communities." [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie)
[10:14:53] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 04-1] "Not to be merged until confirmed with the communities." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie)
[10:17:44] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:19:16] <icinga-wm>	 RECOVERY - Check systemd state on netflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36974 and previous config saved to /var/cache/conftool/dbconfig/20221028-102425-marostegui.json
[10:26:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on puppetdb-test2001.codfw.wmnet with reason: puppetdb 7/bookworm tests
[10:27:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on puppetdb-test2001.codfw.wmnet with reason: puppetdb 7/bookworm tests
[10:39:24] <wikibugs>	 (03PS1) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449
[10:39:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36975 and previous config saved to /var/cache/conftool/dbconfig/20221028-103932-marostegui.json
[10:41:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (owner: 10JMeybohm)
[10:42:49] <wikibugs>	 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF)
[10:47:44] <wikibugs>	 (03PS2) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449
[10:54:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36976 and previous config saved to /var/cache/conftool/dbconfig/20221028-105438-marostegui.json
[10:54:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[10:54:45] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[10:54:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[10:55:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[10:55:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[10:55:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36977 and previous config saved to /var/cache/conftool/dbconfig/20221028-105520-marostegui.json
[10:57:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36978 and previous config saved to /var/cache/conftool/dbconfig/20221028-105733-marostegui.json
[10:57:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual decom
[10:58:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual decom
[11:00:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37833/console" [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:00:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti4003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/850451 (https://phabricator.wikimedia.org/T317247)
[11:03:15] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37834/console" [puppet] - 10https://gerrit.wikimedia.org/r/850449 (owner: 10JMeybohm)
[11:03:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti4003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/850451 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[11:05:34] <logmsgbot>	 !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided)
[11:05:49] <logmsgbot>	 !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 15s)
[11:07:22] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594)
[11:07:53] <wikibugs>	 (03PS3) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (https://phabricator.wikimedia.org/T307943)
[11:09:33] <wikibugs>	 (03PS1) 10Hnowlan: requirements: add missing pycurl package [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453
[11:11:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4003.ulsfo.wmnet
[11:12:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P36979 and previous config saved to /var/cache/conftool/dbconfig/20221028-111240-marostegui.json
[11:12:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou)
[11:15:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[11:16:32] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou)
[11:17:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1029 as es1 master, es1030 as es2 master, es1031 as es3 master', diff saved to https://phabricator.wikimedia.org/P36980 and previous config saved to /var/cache/conftool/dbconfig/20221028-111707-marostegui.json
[11:18:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1026 es1027 es1028 for upgrade', diff saved to https://phabricator.wikimedia.org/P36981 and previous config saved to /var/cache/conftool/dbconfig/20221028-111805-root.json
[11:20:29] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:23:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36982 and previous config saved to /var/cache/conftool/dbconfig/20221028-112317-root.json
[11:26:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:26:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4003.ulsfo.wmnet
[11:26:42] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti4003.ulsfo.wmnet` - ganeti4003.ulsfo.wmnet (**PASS**)...
[11:27:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P36983 and previous config saved to /var/cache/conftool/dbconfig/20221028-112746-marostegui.json
[11:27:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:28:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36984 and previous config saved to /var/cache/conftool/dbconfig/20221028-112818-root.json
[11:33:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36985 and previous config saved to /var/cache/conftool/dbconfig/20221028-113324-root.json
[11:33:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:38:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36986 and previous config saved to /var/cache/conftool/dbconfig/20221028-113822-root.json
[11:42:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36987 and previous config saved to /var/cache/conftool/dbconfig/20221028-114253-marostegui.json
[11:42:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[11:42:59] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[11:43:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[11:43:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:43:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36988 and previous config saved to /var/cache/conftool/dbconfig/20221028-114323-root.json
[11:43:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:43:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P36989 and previous config saved to /var/cache/conftool/dbconfig/20221028-114332-marostegui.json
[11:45:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P36990 and previous config saved to /var/cache/conftool/dbconfig/20221028-114544-marostegui.json
[11:48:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36991 and previous config saved to /var/cache/conftool/dbconfig/20221028-114829-root.json
[11:53:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36992 and previous config saved to /var/cache/conftool/dbconfig/20221028-115327-root.json
[11:54:02] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) > but we're unable to migrate off a version that has been EOL for nearly 2 years without external help.  Let me first start by saying that if there was so...
[11:58:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36993 and previous config saved to /var/cache/conftool/dbconfig/20221028-115828-root.json
[12:00:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36994 and previous config saved to /var/cache/conftool/dbconfig/20221028-120050-marostegui.json
[12:03:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36995 and previous config saved to /var/cache/conftool/dbconfig/20221028-120334-root.json
[12:08:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36996 and previous config saved to /var/cache/conftool/dbconfig/20221028-120832-root.json
[12:09:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[12:13:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36997 and previous config saved to /var/cache/conftool/dbconfig/20221028-121333-root.json
[12:15:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36998 and previous config saved to /var/cache/conftool/dbconfig/20221028-121557-marostegui.json
[12:18:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36999 and previous config saved to /var/cache/conftool/dbconfig/20221028-121839-root.json
[12:23:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37000 and previous config saved to /var/cache/conftool/dbconfig/20221028-122337-root.json
[12:28:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37001 and previous config saved to /var/cache/conftool/dbconfig/20221028-122838-root.json
[12:30:47] <wikibugs>	 (03PS1) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410)
[12:31:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P37002 and previous config saved to /var/cache/conftool/dbconfig/20221028-123103-marostegui.json
[12:31:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[12:31:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[12:31:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37003 and previous config saved to /var/cache/conftool/dbconfig/20221028-123125-marostegui.json
[12:32:40] <wikibugs>	 (03CR) 10Slyngshede: "More or less standard everything. License is add as a LICENSE file. Do we want/need SPDX headers in each file?" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede)
[12:33:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37004 and previous config saved to /var/cache/conftool/dbconfig/20221028-123337-marostegui.json
[12:33:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37005 and previous config saved to /var/cache/conftool/dbconfig/20221028-123344-root.json
[12:36:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) I think it's a good idea to borrow the docker-compose idea from Striker. We already know that we'll need the LDAP container.
[12:37:04] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto)
[12:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37006 and previous config saved to /var/cache/conftool/dbconfig/20221028-123842-root.json
[12:43:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37007 and previous config saved to /var/cache/conftool/dbconfig/20221028-124343-root.json
[12:48:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37008 and previous config saved to /var/cache/conftool/dbconfig/20221028-124845-marostegui.json
[12:48:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37009 and previous config saved to /var/cache/conftool/dbconfig/20221028-124849-root.json
[12:53:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37010 and previous config saved to /var/cache/conftool/dbconfig/20221028-125346-root.json
[12:55:41] <wikibugs>	 (03PS1) 10Jbond: interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467
[12:56:00] <wikibugs>	 (03PS1) 10Muehlenhoff: miscweb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850468 (https://phabricator.wikimedia.org/T308013)
[12:56:02] <wikibugs>	 (03PS1) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013)
[12:56:04] <wikibugs>	 (03PS1) 10Muehlenhoff: mail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013)
[12:56:06] <wikibugs>	 (03PS1) 10Muehlenhoff: dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013)
[12:56:08] <wikibugs>	 (03PS1) 10Muehlenhoff: base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013)
[12:56:10] <wikibugs>	 (03PS1) 10Muehlenhoff: installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013)
[12:56:12] <wikibugs>	 (03PS1) 10Muehlenhoff: parsoid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013)
[12:56:14] <wikibugs>	 (03PS1) 10Muehlenhoff: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013)
[12:56:16] <wikibugs>	 (03PS1) 10Muehlenhoff: ci: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850476 (https://phabricator.wikimedia.org/T308013)
[12:56:18] <wikibugs>	 (03PS1) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013)
[12:56:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 (owner: 10Jbond)
[12:57:39] <wikibugs>	 (03PS2) 10Jbond: interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467
[12:58:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37011 and previous config saved to /var/cache/conftool/dbconfig/20221028-125848-root.json
[13:02:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:02:40] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert)
[13:03:14] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Clement_Goubert)
[13:03:41] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert)
[13:03:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37012 and previous config saved to /var/cache/conftool/dbconfig/20221028-130352-marostegui.json
[13:04:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37013 and previous config saved to /var/cache/conftool/dbconfig/20221028-130400-root.json
[13:04:17] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10Clement_Goubert)
[13:04:40] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert)
[13:08:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37014 and previous config saved to /var/cache/conftool/dbconfig/20221028-130851-root.json
[13:12:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[13:12:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd
[13:13:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37015 and previous config saved to /var/cache/conftool/dbconfig/20221028-131353-root.json
[13:15:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 (owner: 10Jbond)
[13:18:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37016 and previous config saved to /var/cache/conftool/dbconfig/20221028-131858-marostegui.json
[13:19:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[13:19:05] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[13:19:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37017 and previous config saved to /var/cache/conftool/dbconfig/20221028-131905-root.json
[13:19:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[13:19:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37018 and previous config saved to /var/cache/conftool/dbconfig/20221028-131920-marostegui.json
[13:20:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37019 and previous config saved to /var/cache/conftool/dbconfig/20221028-132032-marostegui.json
[13:22:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd
[13:23:27] <wikibugs>	 (03PS1) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484)
[13:26:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez)
[13:28:11] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-etcd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:02] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond)
[13:35:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37020 and previous config saved to /var/cache/conftool/dbconfig/20221028-133538-marostegui.json
[13:38:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10KCVelaga_WMF)
[13:39:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10TAndic) Approving this request as @HShaath-WMF's direct manager.
[13:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:50:29] <wikibugs>	 (03CR) 10Ladsgroup: "I'd be happy to merge this once okay'ed with the community under the condition that I'll block adding any more wikis to this system. The a" [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie)
[13:50:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37021 and previous config saved to /var/cache/conftool/dbconfig/20221028-135045-marostegui.json
[13:52:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Puppet should support VERSION_CODENAME to detect a distro - https://phabricator.wikimedia.org/T321906 (10MoritzMuehlenhoff)
[13:58:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Puppet should support VERSION_CODENAME to detect a distro - https://phabricator.wikimedia.org/T321906 (10MoritzMuehlenhoff)
[13:58:43] <wikibugs>	 (03PS1) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850483 (https://phabricator.wikimedia.org/T317244)
[14:00:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] global: replace labsproject by wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro)
[14:04:10] <wikibugs>	 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10Vgutierrez) 05Resolved→03Open reopening this as I've found some issues:  metric names aren't consistent with existing ones, all the previous metrics are named using...
[14:05:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37022 and previous config saved to /var/cache/conftool/dbconfig/20221028-140552-marostegui.json
[14:05:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[14:05:59] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[14:06:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[14:06:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37023 and previous config saved to /var/cache/conftool/dbconfig/20221028-140613-marostegui.json
[14:06:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add support for bookworm to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783)
[14:07:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:08:29] <wikibugs>	 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10Vgutierrez) oh, and we're seeing some errors like: ` Oct 27 02:06:46 cp4043 prometheus-ats-config[1787]: Traffic Server: failed to fetch proxy.config.net.max_connection...
[14:09:11] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37024 and previous config saved to /var/cache/conftool/dbconfig/20221028-140952-marostegui.json
[14:10:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for <Hibashaath> - https://phabricator.wikimedia.org/T321902 (10Zabe)
[14:10:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10Zabe)
[14:14:19] <wikibugs>	 (03PS2) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484)
[14:15:55] <wikibugs>	 (03PS3) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484)
[14:20:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez)
[14:20:59] <wikibugs>	 (03PS2) 10BBlack: Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328)
[14:21:01] <wikibugs>	 (03PS2) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328)
[14:21:17] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.87 ms
[14:21:24] <wikibugs>	 (03CR) 10BBlack: Add digicert-2022 to available unified set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[14:22:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond)
[14:24:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37025 and previous config saved to /var/cache/conftool/dbconfig/20221028-142459-marostegui.json
[14:26:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] "PCC says no-op on cache nodes, as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/37838/" [puppet] - 10https://gerrit.wikimedia.org/r/849632 (owner: 10BBlack)
[14:29:49] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) (owner: 10Slyngshede)
[14:30:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487
[14:31:25] <wikibugs>	 (03CR) 10Ladsgroup: Schedule image suggestions for ca, no, fi & huwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie)
[14:32:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff)
[14:33:09] <wikibugs>	 10SRE, 10Traffic, 10Performance-Team (Radar): Track TTFB per Cache Status Code in ATS - https://phabricator.wikimedia.org/T321484 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[14:35:36] <wikibugs>	 (03PS2) 10Herron: slo_dashboards: move slo definitions and defaults to files [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749)
[14:36:10] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487
[14:37:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain
[14:38:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain
[14:38:14] <wikibugs>	 (03CR) 10Herron: [C: 03+2] slo_dashboards: move slo definitions and defaults to files (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[14:38:16] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] slo_dashboards: move slo definitions and defaults to files [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[14:40:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37026 and previous config saved to /var/cache/conftool/dbconfig/20221028-144005-marostegui.json
[14:40:11] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-etcd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::instance: stop provisioning /etc/wmflabs-* on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/840791 (owner: 10Majavah)
[14:42:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[14:43:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:44:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:45:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "@Daniel ill leave this for you to merge along with your change" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[14:45:01] <wikibugs>	 10SRE, 10Observability-Metrics: SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10herron) 05Open→03Resolved I think this is resolvable at this point.  Please reopen if I am mistaken!
[14:48:05] <wikibugs>	 (03PS1) 10Jbond: C:ldap::client::utils: migrate to debian::codename function [puppet] - 10https://gerrit.wikimedia.org/r/850498 (https://phabricator.wikimedia.org/T321906)
[14:48:07] <wikibugs>	 (03PS1) 10Jbond: C:debian: add support for testing [puppet] - 10https://gerrit.wikimedia.org/r/850499 (https://phabricator.wikimedia.org/T321906)
[14:49:55] <wikibugs>	 (03PS2) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013)
[14:51:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "fyi ill also leave my change to be merged along with this change by whoever merges" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[14:51:44] <wikibugs>	 (03CR) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[14:52:08] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jclark-ctr) @Jgreen  This still shows active in netbox. Please update to decommission when it's ready
[14:52:12] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) @Jgreen  This still shows active in netbox. Please update to decommission when it's ready
[14:52:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[14:52:35] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[14:52:55] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[14:53:11] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[14:53:16] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[14:55:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37027 and previous config saved to /var/cache/conftool/dbconfig/20221028-145512-marostegui.json
[14:55:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[14:55:19] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[14:55:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM however im not sure we ever used this and wonder if we should just remove it?  i think i originally wanted t5o use it to detect manua" [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff)
[14:55:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[14:55:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37028 and previous config saved to /var/cache/conftool/dbconfig/20221028-145533-marostegui.json
[14:56:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[14:56:37] <wikibugs>	 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (Blocking 🧱), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10thcipriani)
[14:56:47] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: drop unused variable [puppet] - 10https://gerrit.wikimedia.org/r/850503
[14:56:51] <wikibugs>	 (03PS1) 10Majavah: hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799)
[14:57:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:ldap::client::utils: migrate to debian::codename function [puppet] - 10https://gerrit.wikimedia.org/r/850498 (https://phabricator.wikimedia.org/T321906) (owner: 10Jbond)
[14:57:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) I am taking over this ticket  @nskaggs  what day of the week works best for you to do this move?
[14:57:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37029 and previous config saved to /var/cache/conftool/dbconfig/20221028-145746-marostegui.json
[14:58:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah)
[14:59:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: drop unused variable [puppet] - 10https://gerrit.wikimedia.org/r/850503 (owner: 10Majavah)
[14:59:47] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:00:11] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto)
[15:00:24] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10thcipriani) p:05Medium→03Low a:03hashar
[15:01:14] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10thcipriani) @hashar is this waiting on review, or are y...
[15:03:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop installing the base packages list for now [puppet] - 10https://gerrit.wikimedia.org/r/850508
[15:03:12] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37841/console" [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[15:05:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[15:05:55] <wikibugs>	 (03CR) 10Muehlenhoff: Stop installing the base packages list for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850508 (owner: 10Muehlenhoff)
[15:09:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Stop installing the base packages list for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850508 (owner: 10Muehlenhoff)
[15:10:57] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4045 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[15:12:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:12:24] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[15:12:26] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[15:12:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37030 and previous config saved to /var/cache/conftool/dbconfig/20221028-151252-marostegui.json
[15:23:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) @cmjohnson looked at druid10[09-11] bios has not been configured yet.   no ip address in set for idrac have you ran the...
[15:23:37] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4047 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[15:23:55] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4049 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[15:24:15] <wikibugs>	 (03PS6) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108)
[15:24:42] <icinga-wm>	 RECOVERY - Confd vcl based reload on cp4037 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish
[15:28:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37031 and previous config saved to /var/cache/conftool/dbconfig/20221028-152800-marostegui.json
[15:28:16] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321902 (10KCVelaga_WMF)
[15:28:59] <wikibugs>	 (03PS1) 10Ssingh: Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436
[15:29:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hghani - https://phabricator.wikimedia.org/T321910 (10CMyrick-WMF) Approving this request as @Hghani 's direct manager.
[15:32:12] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah)
[15:34:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:35:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) >>! In T313445#8352940, @Jclark-ctr wrote: > I am taking over this ticket  @nskaggs  what day of the week works best for you to do this...
[15:40:54] <wikibugs>	 (03PS1) 10Jdlrobson: ReadingLists on beta cluster for authenticated users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935)
[15:43:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37033 and previous config saved to /var/cache/conftool/dbconfig/20221028-154307-marostegui.json
[15:43:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[15:43:14] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[15:43:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[15:43:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37034 and previous config saved to /var/cache/conftool/dbconfig/20221028-154328-marostegui.json
[15:45:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37035 and previous config saved to /var/cache/conftool/dbconfig/20221028-154541-marostegui.json
[15:49:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[15:49:52] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4052.ulsfo.wmnet with OS buster
[15:50:24] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[15:50:28] <wikibugs>	 (03PS1) 10CDanis: Remove expensive newconnrate logging & tweak concurrency [puppet] - 10https://gerrit.wikimedia.org/r/850517 (https://phabricator.wikimedia.org/T306580)
[15:50:30] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4052.ulsfo.wmnet with OS buster executed with errors: - cp4052 (**FAI...
[15:50:54] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[15:53:16] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "pcc lgtm https://puppet-compiler.wmflabs.org/pcc-worker1003/37842/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850517 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis)
[15:58:25] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37036 and previous config saved to /var/cache/conftool/dbconfig/20221028-160047-marostegui.json
[16:02:17] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[16:03:24] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED
[16:04:04] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED
[16:04:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[16:04:44] <cjming>	 jouncebot: now
[16:04:44] <jouncebot>	 For the next 14 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221028T0700)
[16:05:28] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:06:03] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED
[16:07:07] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED
[16:08:08] <thcipriani>	 cjming: for a labs only change to operations/mediawiki-config, merging on a Friday is fine. Just make sure it's fetched down to the deployment server so there's no surprises for deployers on Monday (srunning "scap backporrt <change>" will do this automagically now)
[16:08:42] <cjming>	 thcipriani: thanks - appreciate it
[16:10:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson)
[16:11:33] <wikibugs>	 (03Merged) 10jenkins-bot: ReadingLists on beta cluster for authenticated users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson)
[16:13:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[16:14:24] <cjming>	 !log deployed ReadingLists on beta cluster for authenticated users - https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935)
[16:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:45] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:15:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37037 and previous config saved to /var/cache/conftool/dbconfig/20221028-161555-marostegui.json
[16:16:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[16:17:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[16:17:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[16:18:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[16:21:01] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster
[16:22:21] <icinga-wm>	 PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:22:38] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:24:02] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:27:18] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[16:27:26] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS buster
[16:28:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:29:11] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[16:29:19] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS buster
[16:31:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37038 and previous config saved to /var/cache/conftool/dbconfig/20221028-163102-marostegui.json
[16:31:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:31:09] <stashbot>	 T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123
[16:31:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[16:31:33] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster
[16:35:36] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH)
[16:38:06] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@62b4181]: testing scap since we are having problems with other instances
[16:38:11] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@62b4181]: testing scap since we are having problems with other instances (duration: 00m 04s)
[16:40:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850483 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)
[16:42:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[16:42:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add support for bookworm to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[16:46:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Jclark-ctr Monday 31st would be fine, or any other day except Tuesday.  If I understand correctly, we need to depool dbproxy1019 and wai...
[16:47:11] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:54:35] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:57:25] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[16:58:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:59:57] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:02:31] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[17:07:01] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided)
[17:07:12] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 11s)
[17:09:40] <logmsgbot>	 !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided)
[17:09:45] <logmsgbot>	 !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 05s)
[17:10:35] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[17:12:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:12:59] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[17:28:09] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS buster
[17:30:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2068 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:31:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4052.ulsfo.wmnet,service=ats-be
[17:31:06] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4052.ulsfo.wmnet,service=ats-tls
[17:31:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe
[17:31:09] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet
[17:33:06] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[17:33:25] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh)
[17:35:20] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) Traffic update: all the new cp hosts in ulsfo are marked active and pooled. Rob: Feel free to mark this as resolved.  Thanks to @RobH, @Papaul, @BBlack, @c...
[17:36:14] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) 05Open→03Resolved
[17:37:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[17:42:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[17:46:28] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436 (owner: 10Ssingh)
[17:47:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[17:50:07] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Ilooremeta - https://phabricator.wikimedia.org/T321918 (10KCVelaga_WMF) Approved from the team's side. @CMacholan can chime in if additional approval is necessary.
[17:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[17:54:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:55:20] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:00:10] <icinga-wm>	 RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:00:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:01:00] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:01:59] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] readme: Add general notes for testing deps [software/acme-chief] - 10https://gerrit.wikimedia.org/r/848512 (owner: 10BCornwall)
[18:08:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[18:08:32] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[18:11:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[18:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:28:10] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10lmata)
[18:30:12] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q2): PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata)
[18:30:17] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q2): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata)
[18:32:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[18:37:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "I have hunted for a while but not found an official way to do this via neutron." [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah)
[18:44:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[18:52:31] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@2326f9c]: Import cirrus indexes to hdfs
[18:54:39] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@2326f9c]: Import cirrus indexes to hdfs (duration: 02m 07s)
[18:59:15] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:04:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks easy enough but still makes me wonder if it will need manual actions on each scap::master host when the repo remote changes" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:06:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "well ok, there are just 2 scap::masters in prod, so can do. not sure about cloud though" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:11:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:14:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] git::clone: Append .git to clone url for gitlab source [puppet] - 10https://gerrit.wikimedia.org/r/850249 (owner: 10Ahmon Dancy)
[19:17:38] <mutante>	 !log contint* - changing source for scap repo to gitlab - gerrit:850246 T321847
[19:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:44] <stashbot>	 T321847: Update scap documentation and other references for new GitLab location - https://phabricator.wikimedia.org/T321847
[19:19:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this was a complete noop on contint* servers. I did not make any manual changes to the repo/remote config. The real test will be when some" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:20:00] <dancy>	 mutante: Thanks!
[19:21:52] <mutante>	 no problem. so yea, the real test would be when there is an actual change in the scap repo
[19:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[19:22:18] <wikibugs>	 (03CR) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:22:33] <mutante>	 also it's "ensure present" not "latest",so puppet does not pull 
[19:22:42] <mutante>	 so whatever/whoever does the pull 
[19:23:17] <mutante>	 will see if it causes anything
[19:24:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] scap::master: Clone the scap repo from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy)
[19:27:53] <dancy>	 I'm not even sure what uses /srv/deployment/scap anymore.  
[19:28:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850468 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[19:28:35] <mutante>	 ok :)
[19:28:39] <mutante>	 good enough
[19:32:12] <dancy>	 Are https://gerrit.wikimedia.org/r/c/operations/puppet/+/850153/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/849699/ on your radar for today?
[19:36:17] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37844/parse2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[19:36:30] <wikibugs>	 (03PS2) 10Dzahn: parsoid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[19:41:39] <mutante>	 yes, but I really know it when I get to it :)
[19:42:13] <dancy>	 haha. ok.  I'm going to go afk for a bit.  I will check back later.
[20:03:48] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: fix a copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108)
[20:03:50] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108)
[20:05:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[20:07:06] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:08:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: fix a copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[20:09:54] <wikibugs>	 (03PS2) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108)
[20:10:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[20:12:25] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] "ports to ports, ips to ips :-)" [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[20:14:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add appledora to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron)
[20:16:05] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[20:19:20] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.14 ms
[20:20:23] <wikibugs>	 (03PS3) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108)
[20:22:50] <wikibugs>	 (03PS1) 10Dzahn: devtools: set profile::gitlab::runner::registration_token: private [puppet] - 10https://gerrit.wikimedia.org/r/850541
[20:26:37] <wikibugs>	 (03PS1) 10Dzahn: devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542
[20:27:59] <wikibugs>	 (03PS2) 10Dzahn: devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542 (https://phabricator.wikimedia.org/T313360)
[20:28:07] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543
[20:30:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn)
[20:31:48] <wikibugs>	 (03PS2) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543
[20:32:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:33:45] <wikibugs>	 (03PS3) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543
[20:34:30] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545
[20:35:10] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1001/37848/" [puppet] - 10https://gerrit.wikimedia.org/r/850543 (owner: 10Andrew Bogott)
[20:35:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "/usr/local/bin/kiwix-rsync-cron.sh has been edited by puppet on clouddumps1001" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:35:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "@nskaggs: noteworthy on clouddumps1001: Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective)" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:37:55] <mutante>	 !log clouddumps1001 - puppet run after merging gerrit:848441 for kiwix, changed ferm status from "stopped" to "running". manually ran 'sudo systemctl start kiwix-mirror-update' T57503
[20:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:05] <stashbot>	 T57503: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503
[20:38:38] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545 (owner: 10JHathaway)
[20:38:41] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545 (owner: 10JHathaway)
[20:39:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "manually started the dumps service. saw no problems: Main PID: 2688062 (code=exited, status=0/SUCCESS)" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:39:51] <wikibugs>	 (03PS3) 10Dzahn: dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503)
[20:40:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:42:39] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120)
[20:42:50] <mutante>	 !log clouddumps* - deployed gerrit:848444 - as kind of expected it fails - most likely the project dirs are not automatically created before rsync runs the first time - T57503
[20:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:56] <wikibugs>	 (03PS2) 10JHathaway: aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120)
[20:44:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "starting the service manually after this change fails - most likely because the project base dirs are not created automatically before rsy" [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:45:10] <wikibugs>	 (03PS4) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 (https://phabricator.wikimedia.org/T321948)
[20:45:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "eh..no..it's: kiwix-rsync-cron.sh: line 58: $: syntax error: operand expected (error" [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[20:45:40] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[20:45:44] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:51:32] <wikibugs>	 (03CR) 10Nskaggs: [C: 03+1] cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 (https://phabricator.wikimedia.org/T321948) (owner: 10Andrew Bogott)
[20:57:56] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms
[20:59:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:03:16] <icinga-wm>	 PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:04:47] <wikibugs>	 (03PS1) 10Dzahn: dumps: fix syntax error in kiwix-rsync-cron.sh [puppet] - 10https://gerrit.wikimedia.org/r/850588 (https://phabricator.wikimedia.org/T57503)
[21:05:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] dumps: fix syntax error in kiwix-rsync-cron.sh [puppet] - 10https://gerrit.wikimedia.org/r/850588 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn)
[21:06:13] <dancy>	 mutante: I'm lurking again
[21:10:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[21:12:38] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10Patch-For-Review, and 2 others: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) @Kelson @nskaggs sync in progress ^
[21:13:25] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10Patch-For-Review, and 2 others: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) I deployed the changes above, a little bugfix follow-up, started the sync service manually.  actual command now running on c...
[21:16:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thanks! https://puppet-compiler.wmflabs.org/pcc-worker1001/37850/registry2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[21:18:49] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on registry1003/registry2003" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond)
[21:19:01] <wikibugs>	 (03PS10) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629)
[21:21:03] <wikibugs>	 (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:25:06] <wikibugs>	 (03CR) 10Dzahn: "oh.. so.. the gitlab_runners are IPs and the contint hosts are DNS names. so the code is " @resolve((10.64.16.105 10.64.32.184 10.64.48.14" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:26:46] <icinga-wm>	 RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:27:28] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: partman config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850590 (https://phabricator.wikimedia.org/T321137)
[21:27:53] <wikibugs>	 (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:28:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:29:27] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: partman config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850590 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[21:30:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed on doc1002, ferm was restarted (noop on doc2001)" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:30:58] <mutante>	 dancy: iptables and rsyncd config should now allow gitlab-runners in addition to contint*. done
[21:31:12] <dancy>	 Thank you! I'll try to test now
[21:31:24] <mutante>	 it mixed IPs and host names and put an "resolve" around it.. but it did not seem to be a problem to do that
[21:31:46] <mutante>	 also mixed IP and host names in 'hosts allow' of rsyncd
[21:33:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:36:05] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: dhcp config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850591 (https://phabricator.wikimedia.org/T321137)
[21:37:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: dhcp config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850591 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[21:39:10] <wikibugs>	 (03PS1) 10Andrew Bogott: add wmcs-securitygroup-backfill [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108)
[21:40:12] <mutante>	 dancy: thanks. it's doc1002 (not 1001 or 200x) fwiw
[21:40:59] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37854/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott)
[21:43:12] <dancy>	 mutante: I've run out of time.  I'll verify on Monday.
[21:43:27] <mutante>	 dancy: alright, good weekend!
[21:47:51] <wikibugs>	 (03PS1) 10Dzahn: ci: move list of contint and zuul hosts to hierdata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593
[21:48:09] <wikibugs>	 (03PS2) 10Dzahn: ci: move list of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593
[21:48:35] <dancy>	 You too!
[21:48:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[21:50:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ci: move list of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[21:51:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:52:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:56:31] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: fix typo in pool :( [puppet] - 10https://gerrit.wikimedia.org/r/850595 (https://phabricator.wikimedia.org/T321137)
[21:57:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix typo in pool :( [puppet] - 10https://gerrit.wikimedia.org/r/850595 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[22:01:11] <mutante>	 bash: rsync: command not found <-- this would be a problem for rsyncing.
[22:04:36] <wikibugs>	 (03PS1) 10Dzahn: gitlab::runner: install rsync package [puppet] - 10https://gerrit.wikimedia.org/r/850597 (https://phabricator.wikimedia.org/T321629)
[22:06:37] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "parameter 'contint_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[22:07:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "shouldn't the data type stay the same when I simply move things around?" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[22:13:06] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) nice, works for me. thanks @BTullis
[22:20:47] <wikibugs>	 (03PS1) 10JHathaway: Revert "aux-k8s: fix typo in pool :(" [puppet] - 10https://gerrit.wikimedia.org/r/850568
[22:23:05] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10Dzahn) @fgiunchedi I think both would be fine, either just don't worry about the duplicate part. I don't see it as a big problem. Or follow the sugg...
[22:23:20] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Revert "aux-k8s: fix typo in pool :(" [puppet] - 10https://gerrit.wikimedia.org/r/850568 (owner: 10JHathaway)
[22:33:47] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: disable lvs [puppet] - 10https://gerrit.wikimedia.org/r/850604 (https://phabricator.wikimedia.org/T321120)
[22:34:42] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: disable lvs [puppet] - 10https://gerrit.wikimedia.org/r/850604 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway)
[22:53:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:53:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:53:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-ctrl1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:58:10] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) see https://dumps.wikimedia.org/kiwix/zim/ now
[22:58:46] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) 05Open→03In progress
[22:58:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Dzahn)
[22:58:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:59:05] <wikibugs>	 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) a:03Dzahn
[23:00:55] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[23:22:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient