[00:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [00:00:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS buster [00:09:41] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:10:29] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [00:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P36934 and previous config saved to /var/cache/conftool/dbconfig/20221028-001124-ladsgroup.json [00:22:55] PROBLEM - Check systemd state on ms-be1067 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:59] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:26:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [00:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36935 and previous config saved to /var/cache/conftool/dbconfig/20221028-002631-ladsgroup.json [00:26:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:26:37] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [00:26:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:26:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [00:27:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [00:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36936 and previous config saved to /var/cache/conftool/dbconfig/20221028-002708-ladsgroup.json [00:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36937 and previous config saved to /var/cache/conftool/dbconfig/20221028-002816-ladsgroup.json [00:29:55] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cp4040, phab1004, releases1002, releases2002, relforge1003, relforge1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:31:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [00:38:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:39:43] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1067 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:43:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36938 and previous config saved to /var/cache/conftool/dbconfig/20221028-004322-ladsgroup.json [00:48:44] (03PS1) 10Ssingh: cp4048: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850307 (https://phabricator.wikimedia.org/T317244) [00:49:11] RECOVERY - Check systemd state on ms-be1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS buster [00:58:13] (03CR) 10Ssingh: [C: 03+2] cp4048: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850307 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [00:58:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36939 and previous config saved to /var/cache/conftool/dbconfig/20221028-005829-ladsgroup.json [00:58:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS buster [00:59:06] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4048.ulsfo.wmnet with OS buster [01:00:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS buster [01:09:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1067 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:13:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T318950)', diff saved to https://phabricator.wikimedia.org/P36940 and previous config saved to /var/cache/conftool/dbconfig/20221028-011335-ladsgroup.json [01:13:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [01:13:41] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [01:13:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [01:13:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36941 and previous config saved to /var/cache/conftool/dbconfig/20221028-011357-ladsgroup.json [01:13:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36942 and previous config saved to /var/cache/conftool/dbconfig/20221028-011505-ladsgroup.json [01:18:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:23:05] PROBLEM - Check systemd state on kubernetes2020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [01:27:13] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:29:43] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:30:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36943 and previous config saved to /var/cache/conftool/dbconfig/20221028-013011-ladsgroup.json [01:30:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [01:33:59] (KubernetesAPILatency) firing: (16) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:12] PROBLEM - DNS on labstore1007.mgmt is CRITICAL: Domain labstore1007.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:41:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:54] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - no problem, we can scrap the idea of having a "recycled status" in Netbox. For everything that gets deleted in Netbox, is there any feature or anyt... [01:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36944 and previous config saved to /var/cache/conftool/dbconfig/20221028-014517-ladsgroup.json [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:28] RECOVERY - Check systemd state on kubernetes2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS buster [02:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T318950)', diff saved to https://phabricator.wikimedia.org/P36945 and previous config saved to /var/cache/conftool/dbconfig/20221028-020024-ladsgroup.json [02:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:00:31] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [02:00:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [02:00:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:00:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:00:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [02:01:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [02:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36946 and previous config saved to /var/cache/conftool/dbconfig/20221028-020117-ladsgroup.json [02:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36947 and previous config saved to /var/cache/conftool/dbconfig/20221028-020425-ladsgroup.json [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:39] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:49] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2020 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:19:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36948 and previous config saved to /var/cache/conftool/dbconfig/20221028-021932-ladsgroup.json [02:34:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36949 and previous config saved to /var/cache/conftool/dbconfig/20221028-023438-ladsgroup.json [02:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:49:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T318950)', diff saved to https://phabricator.wikimedia.org/P36950 and previous config saved to /var/cache/conftool/dbconfig/20221028-024944-ladsgroup.json [02:49:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [02:49:51] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [02:50:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [02:50:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36951 and previous config saved to /var/cache/conftool/dbconfig/20221028-025006-ladsgroup.json [02:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36952 and previous config saved to /var/cache/conftool/dbconfig/20221028-025113-ladsgroup.json [03:05:57] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [03:06:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36953 and previous config saved to /var/cache/conftool/dbconfig/20221028-030620-ladsgroup.json [03:12:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [03:14:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:17:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [03:17:17] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [03:21:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36954 and previous config saved to /var/cache/conftool/dbconfig/20221028-032127-ladsgroup.json [03:24:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [03:30:05] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:03] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T318950)', diff saved to https://phabricator.wikimedia.org/P36955 and previous config saved to /var/cache/conftool/dbconfig/20221028-033633-ladsgroup.json [03:36:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [03:36:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [03:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36956 and previous config saved to /var/cache/conftool/dbconfig/20221028-033654-ladsgroup.json [03:39:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36957 and previous config saved to /var/cache/conftool/dbconfig/20221028-033902-ladsgroup.json [03:39:09] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [03:42:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [03:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36958 and previous config saved to /var/cache/conftool/dbconfig/20221028-035409-ladsgroup.json [04:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [04:09:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36959 and previous config saved to /var/cache/conftool/dbconfig/20221028-040915-ladsgroup.json [04:17:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318950)', diff saved to https://phabricator.wikimedia.org/P36960 and previous config saved to /var/cache/conftool/dbconfig/20221028-042421-ladsgroup.json [04:24:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [04:24:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [04:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36961 and previous config saved to /var/cache/conftool/dbconfig/20221028-042443-ladsgroup.json [04:24:45] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [04:25:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36962 and previous config saved to /var/cache/conftool/dbconfig/20221028-042550-ladsgroup.json [04:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36963 and previous config saved to /var/cache/conftool/dbconfig/20221028-044057-ladsgroup.json [04:52:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:53:25] (03PS2) 10KartikMistry: Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289) [04:56:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36964 and previous config saved to /var/cache/conftool/dbconfig/20221028-045603-ladsgroup.json [04:57:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:06:22] (03PS1) 10KartikMistry: Update cxserver to 2022-10-27-102021-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/850315 (https://phabricator.wikimedia.org/T225494) [05:06:51] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:11:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318950)', diff saved to https://phabricator.wikimedia.org/P36965 and previous config saved to /var/cache/conftool/dbconfig/20221028-051110-ladsgroup.json [05:11:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:11:17] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [05:11:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:19:55] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:25:37] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [05:34:14] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:49:45] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:41] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:18:13] (03PS2) 10Hashar: opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) [06:28:57] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:17] (03CR) 10Hashar: "Patchset 2 is a rebase I have send back to Gerrit by mistake." [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [06:48:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:50:41] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221028T0700) [07:04:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond) [07:29:51] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:43:18] 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) p:05Triage→03High [07:49:05] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:49:07] (03CR) 10Kosta Harlan: [C: 03+2] [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan) [07:49:53] (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan) [07:50:50] (03CR) 10Jaime Nuche: [C: 03+1] "Thanks for updating this!" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [07:56:01] (03PS1) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) [07:56:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:57:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:57:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:58:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:01:13] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [08:17:57] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:20:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:21:25] (03PS1) 10Cathal Mooney: Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411 [08:24:40] (03Abandoned) 10Vgutierrez: ATS: Limit NUMA nodes usage on ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/666871 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [08:25:23] (03Abandoned) 10Vgutierrez: Backport several fixes scheduled for 9.1.3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/820121 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [08:26:29] (03Abandoned) 10Vgutierrez: ATS: Clean libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578179 (owner: 10Vgutierrez) [08:26:34] (03Abandoned) 10Vgutierrez: ATS: Remove libhwloc5 pin [puppet] - 10https://gerrit.wikimedia.org/r/578180 (owner: 10Vgutierrez) [08:26:52] (03Abandoned) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez) [08:27:16] (03Abandoned) 10Vgutierrez: Revert PR #7465 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/821777 (owner: 10Vgutierrez) [08:27:29] (03Abandoned) 10Vgutierrez: Release 8.0.7-rc0-1wm3asan [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/590993 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:28:29] (03Abandoned) 10Vgutierrez: ATS: Provide http to https redirection logic in lua [puppet] - 10https://gerrit.wikimedia.org/r/603447 (https://phabricator.wikimedia.org/T254235) (owner: 10Vgutierrez) [08:29:23] RECOVERY - Host netflow1002 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [08:29:23] PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:08] (03CR) 10Vgutierrez: prometheus: Add ats header/body size total metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [08:34:55] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:35:56] (03CR) 10Vgutierrez: "looks good, please provide a VTC for this." [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [08:37:59] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [08:39:09] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37825/console" [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [08:40:16] 10SRE, 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869 (10Peachey88) [08:40:44] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) p:05Triage→03Low [08:41:55] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [08:42:31] (03CR) 10Vgutierrez: [V: 03+1] "looks good, but profile::cache::haproxy::unified_certs needs to be updated to let the certs actually be deployed on the cp servers." [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [08:42:44] (03CR) 10Vgutierrez: [V: 03+1 C: 04-1] Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [08:42:45] (03PS1) 10Muehlenhoff: Make ganeti4006 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/850413 (https://phabricator.wikimedia.org/T317247) [08:44:07] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>! @jhathaway > Thanks Brian for bringing up some alternative ideas! >>>! @bking >> I wonder if our energies might be better spent searching for >> alt... [08:44:48] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4006 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/850413 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [08:47:06] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>! @bking > Thanks! Your perspective as a both a Puppet expert and relative n00b like me is very much appreciated. I hope you (and everyone else) will... [08:49:28] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) >>>>! @bking >> I agree, it will be very time-consuming and painful to move off Puppet. But the current situation also seems painful and untenable. >>... [08:49:48] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37826/console" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [08:49:57] (03CR) 10Vgutierrez: [C: 03+1] readme: Add general notes for testing deps [software/acme-chief] - 10https://gerrit.wikimedia.org/r/848512 (owner: 10BCornwall) [08:50:47] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840141 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:51:23] (03CR) 10Vgutierrez: [C: 03+1] docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 (owner: 10Krinkle) [08:54:30] (03CR) 10Cathal Mooney: [C: 03+2] Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411 (owner: 10Cathal Mooney) [08:55:01] RECOVERY - Check whether ferm is active by checking the default input chain on netflow1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:55:08] (03Merged) 10jenkins-bot: Update IP for netflow1002 [homer/public] - 10https://gerrit.wikimedia.org/r/850411 (owner: 10Cathal Mooney) [08:55:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [08:56:23] RECOVERY - Check systemd state on mw2334 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [09:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [09:03:00] (03PS1) 10Vgutierrez: aptrepo: Add thirdparty/haproxy26 [puppet] - 10https://gerrit.wikimedia.org/r/850416 (https://phabricator.wikimedia.org/T321775) [09:03:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [09:05:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster eqiad and group A [09:05:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster eqiad and group A [09:05:32] (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:05:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [09:05:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [09:10:16] (03PS1) 10Vgutierrez: cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) [09:10:44] (03CR) 10Klausman: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37827/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:11:28] (03CR) 10CI reject: [V: 04-1] cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:13:16] (03CR) 10JMeybohm: "This will require the operator service account to have privileges to create secrets in the cluster/namespace where is runs. But from what " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:13:21] (03PS2) 10Vgutierrez: cache::haproxy: Allow choosing between HAProxy 2.4 and 2.6 [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) [09:13:33] (03PS1) 10Klausman: wikilabels: maybe get the tuning.conf source part right [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389) [09:14:24] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37828/console" [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:14:45] (03CR) 10Klausman: [V: 03+1 C: 03+2] wikilabels: maybe get the tuning.conf source part right [puppet] - 10https://gerrit.wikimedia.org/r/850418 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:15:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [09:15:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [09:17:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [09:18:05] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10MatthewVernon) [I'm not saying we should move to Ansible necessarily, but wanted to respond to something said up-thread :)] I've used Ansible a fair amount in p... [09:18:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:18:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:18:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:18:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [09:19:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [09:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36967 and previous config saved to /var/cache/conftool/dbconfig/20221028-091912-marostegui.json [09:19:18] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:20:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37829/console" [puppet] - 10https://gerrit.wikimedia.org/r/850417 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:21:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36968 and previous config saved to /var/cache/conftool/dbconfig/20221028-092125-marostegui.json [09:23:28] (03PS1) 10Klausman: wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) [09:25:40] (03CR) 10CI reject: [V: 04-1] wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:30:45] (03PS1) 10Vgutierrez: cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) [09:31:29] (03CR) 10Elukey: wikilabels: actually install Postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:32:36] (03CR) 10Klausman: wikilabels: actually install Postgres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:33:29] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37831/console" [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [09:34:14] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P36969 and previous config saved to /var/cache/conftool/dbconfig/20221028-093631-marostegui.json [09:37:20] (03Abandoned) 10Klausman: wikilabels: actually install Postgres [puppet] - 10https://gerrit.wikimedia.org/r/850419 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [09:40:20] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. I left one suggestion in-line. We can also refactor this later if you both agree, this is not blocking." [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [09:41:39] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10Vgutierrez) 05Open→03In progress [09:45:12] (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:46:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow1002.eqiad.wmnet [09:51:09] (03PS1) 10Ladsgroup: auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 [09:51:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P36970 and previous config saved to /var/cache/conftool/dbconfig/20221028-095138-marostegui.json [09:52:21] (03CR) 10Marostegui: [C: 03+1] auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup) [09:52:41] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup) [09:53:36] !log drain ganeti4003 for eventual decom T317247 [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [09:53:54] (03Merged) 10jenkins-bot: auto_schema: Drop --include-masters option [software] - 10https://gerrit.wikimedia.org/r/850421 (owner: 10Ladsgroup) [09:56:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow1002.eqiad.wmnet [09:56:41] PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36971 and previous config saved to /var/cache/conftool/dbconfig/20221028-100644-marostegui.json [10:06:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:06:51] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:07:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36972 and previous config saved to /var/cache/conftool/dbconfig/20221028-100706-marostegui.json [10:09:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36973 and previous config saved to /var/cache/conftool/dbconfig/20221028-100918-marostegui.json [10:11:28] (03PS1) 10Matthias Mullie: Enable ImageSuggestions on ca, no, fi & hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) [10:12:26] (03PS2) 10Matthias Mullie: Enable ImageSuggestions on ca, no, fi & huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) [10:13:35] (03PS1) 10Matthias Mullie: Schedule image suggestions for ca, no, fi & huwiki [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) [10:13:59] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik) [10:14:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:14:43] (03CR) 10Matthias Mullie: [C: 04-1] "Not to be merged until confirmed with the communities." [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [10:14:53] (03CR) 10Matthias Mullie: [C: 04-1] "Not to be merged until confirmed with the communities." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850425 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [10:17:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:19:16] RECOVERY - Check systemd state on netflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36974 and previous config saved to /var/cache/conftool/dbconfig/20221028-102425-marostegui.json [10:26:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on puppetdb-test2001.codfw.wmnet with reason: puppetdb 7/bookworm tests [10:27:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on puppetdb-test2001.codfw.wmnet with reason: puppetdb 7/bookworm tests [10:39:24] (03PS1) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 [10:39:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P36975 and previous config saved to /var/cache/conftool/dbconfig/20221028-103932-marostegui.json [10:41:29] (03CR) 10CI reject: [V: 04-1] Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (owner: 10JMeybohm) [10:42:49] 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) [10:47:44] (03PS2) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 [10:54:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321123)', diff saved to https://phabricator.wikimedia.org/P36976 and previous config saved to /var/cache/conftool/dbconfig/20221028-105438-marostegui.json [10:54:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:54:45] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:54:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:55:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:55:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36977 and previous config saved to /var/cache/conftool/dbconfig/20221028-105520-marostegui.json [10:57:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36978 and previous config saved to /var/cache/conftool/dbconfig/20221028-105733-marostegui.json [10:57:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual decom [10:58:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual decom [11:00:01] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37833/console" [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:00:32] (03PS1) 10Muehlenhoff: Remove ganeti4003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/850451 (https://phabricator.wikimedia.org/T317247) [11:03:15] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37834/console" [puppet] - 10https://gerrit.wikimedia.org/r/850449 (owner: 10JMeybohm) [11:03:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti4003 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/850451 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [11:05:34] !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided) [11:05:49] !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 15s) [11:07:22] (03PS1) 10AikoChou: ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594) [11:07:53] (03PS3) 10JMeybohm: Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (https://phabricator.wikimedia.org/T307943) [11:09:33] (03PS1) 10Hnowlan: requirements: add missing pycurl package [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 [11:11:03] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4003.ulsfo.wmnet [11:12:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P36979 and previous config saved to /var/cache/conftool/dbconfig/20221028-111240-marostegui.json [11:12:59] (03CR) 10Klausman: [C: 03+2] ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [11:15:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:16:32] (03Merged) 10jenkins-bot: ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/850452 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [11:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1029 as es1 master, es1030 as es2 master, es1031 as es3 master', diff saved to https://phabricator.wikimedia.org/P36980 and previous config saved to /var/cache/conftool/dbconfig/20221028-111707-marostegui.json [11:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1026 es1027 es1028 for upgrade', diff saved to https://phabricator.wikimedia.org/P36981 and previous config saved to /var/cache/conftool/dbconfig/20221028-111805-root.json [11:20:29] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:23:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36982 and previous config saved to /var/cache/conftool/dbconfig/20221028-112317-root.json [11:26:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:26:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4003.ulsfo.wmnet [11:26:42] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti4003.ulsfo.wmnet` - ganeti4003.ulsfo.wmnet (**PASS**)... [11:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P36983 and previous config saved to /var/cache/conftool/dbconfig/20221028-112746-marostegui.json [11:27:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36984 and previous config saved to /var/cache/conftool/dbconfig/20221028-112818-root.json [11:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36985 and previous config saved to /var/cache/conftool/dbconfig/20221028-113324-root.json [11:33:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36986 and previous config saved to /var/cache/conftool/dbconfig/20221028-113822-root.json [11:42:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36987 and previous config saved to /var/cache/conftool/dbconfig/20221028-114253-marostegui.json [11:42:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:42:59] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:43:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:43:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36988 and previous config saved to /var/cache/conftool/dbconfig/20221028-114323-root.json [11:43:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P36989 and previous config saved to /var/cache/conftool/dbconfig/20221028-114332-marostegui.json [11:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P36990 and previous config saved to /var/cache/conftool/dbconfig/20221028-114544-marostegui.json [11:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36991 and previous config saved to /var/cache/conftool/dbconfig/20221028-114829-root.json [11:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36992 and previous config saved to /var/cache/conftool/dbconfig/20221028-115327-root.json [11:54:02] 10Puppet, 10Infrastructure-Foundations: Consider migrating alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) > but we're unable to migrate off a version that has been EOL for nearly 2 years without external help. Let me first start by saying that if there was so... [11:58:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36993 and previous config saved to /var/cache/conftool/dbconfig/20221028-115828-root.json [12:00:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36994 and previous config saved to /var/cache/conftool/dbconfig/20221028-120050-marostegui.json [12:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36995 and previous config saved to /var/cache/conftool/dbconfig/20221028-120334-root.json [12:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36996 and previous config saved to /var/cache/conftool/dbconfig/20221028-120832-root.json [12:09:56] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [12:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36997 and previous config saved to /var/cache/conftool/dbconfig/20221028-121333-root.json [12:15:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P36998 and previous config saved to /var/cache/conftool/dbconfig/20221028-121557-marostegui.json [12:18:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36999 and previous config saved to /var/cache/conftool/dbconfig/20221028-121839-root.json [12:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37000 and previous config saved to /var/cache/conftool/dbconfig/20221028-122337-root.json [12:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37001 and previous config saved to /var/cache/conftool/dbconfig/20221028-122838-root.json [12:30:47] (03PS1) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [12:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321123)', diff saved to https://phabricator.wikimedia.org/P37002 and previous config saved to /var/cache/conftool/dbconfig/20221028-123103-marostegui.json [12:31:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:31:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:31:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37003 and previous config saved to /var/cache/conftool/dbconfig/20221028-123125-marostegui.json [12:32:40] (03CR) 10Slyngshede: "More or less standard everything. License is add as a LICENSE file. Do we want/need SPDX headers in each file?" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [12:33:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37004 and previous config saved to /var/cache/conftool/dbconfig/20221028-123337-marostegui.json [12:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37005 and previous config saved to /var/cache/conftool/dbconfig/20221028-123344-root.json [12:36:41] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) I think it's a good idea to borrow the docker-compose idea from Striker. We already know that we'll need the LDAP container. [12:37:04] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [12:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37006 and previous config saved to /var/cache/conftool/dbconfig/20221028-123842-root.json [12:43:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37007 and previous config saved to /var/cache/conftool/dbconfig/20221028-124343-root.json [12:48:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37008 and previous config saved to /var/cache/conftool/dbconfig/20221028-124845-marostegui.json [12:48:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37009 and previous config saved to /var/cache/conftool/dbconfig/20221028-124849-root.json [12:53:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37010 and previous config saved to /var/cache/conftool/dbconfig/20221028-125346-root.json [12:55:41] (03PS1) 10Jbond: interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 [12:56:00] (03PS1) 10Muehlenhoff: miscweb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850468 (https://phabricator.wikimedia.org/T308013) [12:56:02] (03PS1) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) [12:56:04] (03PS1) 10Muehlenhoff: mail: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013) [12:56:06] (03PS1) 10Muehlenhoff: dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) [12:56:08] (03PS1) 10Muehlenhoff: base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013) [12:56:10] (03PS1) 10Muehlenhoff: installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013) [12:56:12] (03PS1) 10Muehlenhoff: parsoid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013) [12:56:14] (03PS1) 10Muehlenhoff: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013) [12:56:16] (03PS1) 10Muehlenhoff: ci: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850476 (https://phabricator.wikimedia.org/T308013) [12:56:18] (03PS1) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) [12:56:39] (03CR) 10CI reject: [V: 04-1] interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 (owner: 10Jbond) [12:57:39] (03PS2) 10Jbond: interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 [12:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37011 and previous config saved to /var/cache/conftool/dbconfig/20221028-125848-root.json [13:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:02:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert) [13:03:14] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Clement_Goubert) [13:03:41] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert) [13:03:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37012 and previous config saved to /var/cache/conftool/dbconfig/20221028-130352-marostegui.json [13:04:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37013 and previous config saved to /var/cache/conftool/dbconfig/20221028-130400-root.json [13:04:17] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-videoscaler helmfile deployment - https://phabricator.wikimedia.org/T321899 (10Clement_Goubert) [13:04:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert) [13:08:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37014 and previous config saved to /var/cache/conftool/dbconfig/20221028-130851-root.json [13:12:19] (03CR) 10CDanis: [C: 03+1] cache::haproxy: Switch to HAProxy 2.6 on concurrency tracking instances [puppet] - 10https://gerrit.wikimedia.org/r/850420 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [13:12:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd [13:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37015 and previous config saved to /var/cache/conftool/dbconfig/20221028-131353-root.json [13:15:57] (03CR) 10Jbond: [C: 03+2] interface_primary: dont flush facts on facter4 [puppet] - 10https://gerrit.wikimedia.org/r/850467 (owner: 10Jbond) [13:18:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321123)', diff saved to https://phabricator.wikimedia.org/P37016 and previous config saved to /var/cache/conftool/dbconfig/20221028-131858-marostegui.json [13:19:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:19:05] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P37017 and previous config saved to /var/cache/conftool/dbconfig/20221028-131905-root.json [13:19:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37018 and previous config saved to /var/cache/conftool/dbconfig/20221028-131920-marostegui.json [13:20:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37019 and previous config saved to /var/cache/conftool/dbconfig/20221028-132032-marostegui.json [13:22:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to drbd [13:23:27] (03PS1) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) [13:26:01] (03CR) 10CI reject: [V: 04-1] mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez) [13:28:11] PROBLEM - Check systemd state on dse-k8s-etcd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:02] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jbond) [13:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37020 and previous config saved to /var/cache/conftool/dbconfig/20221028-133538-marostegui.json [13:38:53] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10KCVelaga_WMF) [13:39:56] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10TAndic) Approving this request as @HShaath-WMF's direct manager. [13:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:50:29] (03CR) 10Ladsgroup: "I'd be happy to merge this once okay'ed with the community under the condition that I'll block adding any more wikis to this system. The a" [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [13:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37021 and previous config saved to /var/cache/conftool/dbconfig/20221028-135045-marostegui.json [13:52:16] 10SRE, 10Infrastructure-Foundations: Puppet should support VERSION_CODENAME to detect a distro - https://phabricator.wikimedia.org/T321906 (10MoritzMuehlenhoff) [13:58:31] 10SRE, 10Infrastructure-Foundations: Puppet should support VERSION_CODENAME to detect a distro - https://phabricator.wikimedia.org/T321906 (10MoritzMuehlenhoff) [13:58:43] (03PS1) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850483 (https://phabricator.wikimedia.org/T317244) [14:00:18] (03CR) 10Andrew Bogott: [C: 03+1] global: replace labsproject by wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [14:04:10] 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10Vgutierrez) 05Resolved→03Open reopening this as I've found some issues: metric names aren't consistent with existing ones, all the previous metrics are named using... [14:05:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P37022 and previous config saved to /var/cache/conftool/dbconfig/20221028-140552-marostegui.json [14:05:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:05:59] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:06:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37023 and previous config saved to /var/cache/conftool/dbconfig/20221028-140613-marostegui.json [14:06:56] (03PS1) 10Muehlenhoff: Add support for bookworm to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) [14:07:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:08:29] 10SRE, 10Traffic: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10Vgutierrez) oh, and we're seeing some errors like: ` Oct 27 02:06:46 cp4043 prometheus-ats-config[1787]: Traffic Server: failed to fetch proxy.config.net.max_connection... [14:09:11] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:09:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37024 and previous config saved to /var/cache/conftool/dbconfig/20221028-140952-marostegui.json [14:10:09] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for - https://phabricator.wikimedia.org/T321902 (10Zabe) [14:10:11] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321903 (10Zabe) [14:14:19] (03PS2) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) [14:15:55] (03PS3) 10Vgutierrez: mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) [14:20:05] (03CR) 10Vgutierrez: [C: 03+2] mtail: Provide trafficserver_backend_cache_result_code_client_ttfb [puppet] - 10https://gerrit.wikimedia.org/r/850481 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez) [14:20:59] (03PS2) 10BBlack: Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) [14:21:01] (03PS2) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328) [14:21:17] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 13.87 ms [14:21:24] (03CR) 10BBlack: Add digicert-2022 to available unified set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [14:22:58] (03CR) 10Jbond: [C: 03+2] C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond) [14:24:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37025 and previous config saved to /var/cache/conftool/dbconfig/20221028-142459-marostegui.json [14:26:27] (03CR) 10BBlack: [C: 03+2] "PCC says no-op on cache nodes, as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/37838/" [puppet] - 10https://gerrit.wikimedia.org/r/849632 (owner: 10BBlack) [14:29:49] (03CR) 10Thcipriani: [C: 03+1] data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) (owner: 10Slyngshede) [14:30:32] (03PS1) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 [14:31:25] (03CR) 10Ladsgroup: Schedule image suggestions for ca, no, fi & huwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850446 (https://phabricator.wikimedia.org/T300064) (owner: 10Matthias Mullie) [14:32:40] (03CR) 10CI reject: [V: 04-1] Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff) [14:33:09] 10SRE, 10Traffic, 10Performance-Team (Radar): Track TTFB per Cache Status Code in ATS - https://phabricator.wikimedia.org/T321484 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:35:36] (03PS2) 10Herron: slo_dashboards: move slo definitions and defaults to files [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) [14:36:10] (03PS2) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 [14:37:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain [14:38:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1002.eqiad.wmnet to plain [14:38:14] (03CR) 10Herron: [C: 03+2] slo_dashboards: move slo definitions and defaults to files (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [14:38:16] (03CR) 10Herron: [V: 03+2 C: 03+2] slo_dashboards: move slo definitions and defaults to files [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [14:40:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37026 and previous config saved to /var/cache/conftool/dbconfig/20221028-144005-marostegui.json [14:40:11] RECOVERY - Check systemd state on dse-k8s-etcd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:51] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::instance: stop provisioning /etc/wmflabs-* on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/840791 (owner: 10Majavah) [14:42:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [14:43:53] (03CR) 10Jbond: [C: 03+1] dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:44:05] (03CR) 10Jbond: [C: 03+1] base: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850472 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:45:00] (03CR) 10Jbond: [V: 03+1] "@Daniel ill leave this for you to merge along with your change" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [14:45:01] 10SRE, 10Observability-Metrics: SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10herron) 05Open→03Resolved I think this is resolvable at this point. Please reopen if I am mistaken! [14:48:05] (03PS1) 10Jbond: C:ldap::client::utils: migrate to debian::codename function [puppet] - 10https://gerrit.wikimedia.org/r/850498 (https://phabricator.wikimedia.org/T321906) [14:48:07] (03PS1) 10Jbond: C:debian: add support for testing [puppet] - 10https://gerrit.wikimedia.org/r/850499 (https://phabricator.wikimedia.org/T321906) [14:49:55] (03PS2) 10Muehlenhoff: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) [14:51:01] (03CR) 10Jbond: [V: 03+1] "fyi ill also leave my change to be merged along with this change by whoever merges" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [14:51:44] (03CR) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [14:52:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jclark-ctr) @Jgreen This still shows active in netbox. Please update to decommission when it's ready [14:52:12] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) @Jgreen This still shows active in netbox. Please update to decommission when it's ready [14:52:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:52:35] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:52:55] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:53:11] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [14:53:16] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/850470 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321123)', diff saved to https://phabricator.wikimedia.org/P37027 and previous config saved to /var/cache/conftool/dbconfig/20221028-145512-marostegui.json [14:55:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [14:55:19] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:55:27] (03CR) 10Jbond: [C: 03+1] "LGTM however im not sure we ever used this and wonder if we should just remove it? i think i originally wanted t5o use it to detect manua" [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff) [14:55:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [14:55:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37028 and previous config saved to /var/cache/conftool/dbconfig/20221028-145533-marostegui.json [14:56:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [14:56:37] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (Blocking 🧱), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10thcipriani) [14:56:47] (03PS1) 10Majavah: P:wmcs::metricsinfra: drop unused variable [puppet] - 10https://gerrit.wikimedia.org/r/850503 [14:56:51] (03PS1) 10Majavah: hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799) [14:57:10] (03CR) 10Jbond: [C: 03+2] C:ldap::client::utils: migrate to debian::codename function [puppet] - 10https://gerrit.wikimedia.org/r/850498 (https://phabricator.wikimedia.org/T321906) (owner: 10Jbond) [14:57:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Jclark-ctr) I am taking over this ticket @nskaggs what day of the week works best for you to do this move? [14:57:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37029 and previous config saved to /var/cache/conftool/dbconfig/20221028-145746-marostegui.json [14:58:36] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah) [14:59:24] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: drop unused variable [puppet] - 10https://gerrit.wikimedia.org/r/850503 (owner: 10Majavah) [14:59:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:00:11] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: enable restrict_firewall for Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/849499 (https://phabricator.wikimedia.org/T317341) (owner: 10Jelto) [15:00:24] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10thcipriani) p:05Medium→03Low a:03hashar [15:01:14] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10thcipriani) @hashar is this waiting on review, or are y... [15:03:09] (03PS1) 10Muehlenhoff: Stop installing the base packages list for now [puppet] - 10https://gerrit.wikimedia.org/r/850508 [15:03:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37841/console" [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [15:05:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [15:05:55] (03CR) 10Muehlenhoff: Stop installing the base packages list for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850508 (owner: 10Muehlenhoff) [15:09:02] (03CR) 10Jbond: [C: 03+1] Stop installing the base packages list for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850508 (owner: 10Muehlenhoff) [15:10:57] RECOVERY - Confd vcl based reload on cp4045 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [15:12:17] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:12:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [15:12:26] (03PS2) 10Jcrespo: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850475 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37030 and previous config saved to /var/cache/conftool/dbconfig/20221028-151252-marostegui.json [15:23:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) @cmjohnson looked at druid10[09-11] bios has not been configured yet. no ip address in set for idrac have you ran the... [15:23:37] RECOVERY - Confd vcl based reload on cp4047 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [15:23:55] RECOVERY - Confd vcl based reload on cp4049 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [15:24:15] (03PS6) 10Majavah: openstack::keystone: provision new security group rules for metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) [15:24:42] RECOVERY - Confd vcl based reload on cp4037 is OK: reload-vcl successfully ran 0h, 1 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [15:28:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37031 and previous config saved to /var/cache/conftool/dbconfig/20221028-152800-marostegui.json [15:28:16] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hibashaath - https://phabricator.wikimedia.org/T321902 (10KCVelaga_WMF) [15:28:59] (03PS1) 10Ssingh: Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436 [15:29:49] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Hghani - https://phabricator.wikimedia.org/T321910 (10CMyrick-WMF) Approving this request as @Hghani 's direct manager. [15:32:12] (03CR) 10David Caro: [C: 03+1] hieradata: replace metricsinfra prometheus01 [puppet] - 10https://gerrit.wikimedia.org/r/850504 (https://phabricator.wikimedia.org/T310799) (owner: 10Majavah) [15:34:14] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:35:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) >>! In T313445#8352940, @Jclark-ctr wrote: > I am taking over this ticket @nskaggs what day of the week works best for you to do this... [15:40:54] (03PS1) 10Jdlrobson: ReadingLists on beta cluster for authenticated users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) [15:43:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321123)', diff saved to https://phabricator.wikimedia.org/P37033 and previous config saved to /var/cache/conftool/dbconfig/20221028-154307-marostegui.json [15:43:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:43:14] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:43:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:43:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37034 and previous config saved to /var/cache/conftool/dbconfig/20221028-154328-marostegui.json [15:45:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37035 and previous config saved to /var/cache/conftool/dbconfig/20221028-154541-marostegui.json [15:49:44] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [15:49:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp4052.ulsfo.wmnet with OS buster [15:50:24] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [15:50:28] (03PS1) 10CDanis: Remove expensive newconnrate logging & tweak concurrency [puppet] - 10https://gerrit.wikimedia.org/r/850517 (https://phabricator.wikimedia.org/T306580) [15:50:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp4052.ulsfo.wmnet with OS buster executed with errors: - cp4052 (**FAI... [15:50:54] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [15:53:16] (03CR) 10CDanis: [C: 03+2] "pcc lgtm https://puppet-compiler.wmflabs.org/pcc-worker1003/37842/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850517 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [15:58:25] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37036 and previous config saved to /var/cache/conftool/dbconfig/20221028-160047-marostegui.json [16:02:17] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [16:03:24] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [16:04:04] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4009.mgmt.ulsfo.wmnet with reboot policy FORCED [16:04:18] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:04:44] jouncebot: now [16:04:44] For the next 14 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221028T0700) [16:05:28] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:06:03] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [16:07:07] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4010.mgmt.ulsfo.wmnet with reboot policy FORCED [16:08:08] cjming: for a labs only change to operations/mediawiki-config, merging on a Friday is fine. Just make sure it's fetched down to the deployment server so there's no surprises for deployers on Monday (srunning "scap backporrt " will do this automagically now) [16:08:42] thcipriani: thanks - appreciate it [16:10:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson) [16:11:33] (03Merged) 10jenkins-bot: ReadingLists on beta cluster for authenticated users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) (owner: 10Jdlrobson) [16:13:17] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [16:14:24] !log deployed ReadingLists on beta cluster for authenticated users - https://gerrit.wikimedia.org/r/850516 (https://phabricator.wikimedia.org/T317935) [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37037 and previous config saved to /var/cache/conftool/dbconfig/20221028-161555-marostegui.json [16:16:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:17:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:17:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:18:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:21:01] !log robh@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [16:22:21] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:22:38] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:24:02] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:27:18] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [16:27:26] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS buster [16:28:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:29:11] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [16:29:19] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS buster [16:31:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321123)', diff saved to https://phabricator.wikimedia.org/P37038 and previous config saved to /var/cache/conftool/dbconfig/20221028-163102-marostegui.json [16:31:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:31:09] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:31:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:31:33] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [16:35:36] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [16:38:06] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@62b4181]: testing scap since we are having problems with other instances [16:38:11] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@62b4181]: testing scap since we are having problems with other instances (duration: 00m 04s) [16:40:09] (03CR) 10Ssingh: [C: 03+2] cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850483 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [16:42:24] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:42:38] (03CR) 10Filippo Giunchedi: [C: 03+1] Add support for bookworm to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/850484 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [16:46:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10fnegri) @Jclark-ctr Monday 31st would be fine, or any other day except Tuesday. If I understand correctly, we need to depool dbproxy1019 and wai... [16:47:11] PROBLEM - Check systemd state on ms-be2068 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:54:35] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:57:25] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [16:58:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:59:57] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2068 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:02:31] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [17:07:01] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided) [17:07:12] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 11s) [17:09:40] !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@c849762]: (no justification provided) [17:09:45] !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@c849762]: (no justification provided) (duration: 00m 05s) [17:10:35] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:12:59] RECOVERY - Check systemd state on ms-be2068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:59] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:28:09] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS buster [17:30:22] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2068 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:31:06] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4052.ulsfo.wmnet,service=ats-be [17:31:06] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4052.ulsfo.wmnet,service=ats-tls [17:31:07] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe [17:31:09] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [17:33:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [17:33:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [17:35:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) Traffic update: all the new cp hosts in ulsfo are marked active and pooled. Rob: Feel free to mark this as resolved. Thanks to @RobH, @Papaul, @BBlack, @c... [17:36:14] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) 05Open→03Resolved [17:37:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:42:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:46:28] (03CR) 10BBlack: [C: 03+1] Revert "Depool ulsfo for cp hosts hardware refresh" [dns] - 10https://gerrit.wikimedia.org/r/850436 (owner: 10Ssingh) [17:47:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:50:07] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Ilooremeta - https://phabricator.wikimedia.org/T321918 (10KCVelaga_WMF) Approved from the team's side. @CMacholan can chime in if additional approval is necessary. [17:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:54:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:55:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:00:10] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:24] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:59] (03CR) 10BCornwall: [C: 03+2] readme: Add general notes for testing deps [software/acme-chief] - 10https://gerrit.wikimedia.org/r/848512 (owner: 10BCornwall) [18:08:31] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [18:08:32] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:11:01] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [18:20:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:28:10] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10lmata) [18:30:12] 10SRE, 10Icinga, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q2): PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL - https://phabricator.wikimedia.org/T311926 (10lmata) [18:30:17] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q2): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [18:32:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:37:09] (03CR) 10Andrew Bogott: [C: 03+2] "I have hunted for a while but not found an official way to do this via neutron." [puppet] - 10https://gerrit.wikimedia.org/r/806418 (https://phabricator.wikimedia.org/T288108) (owner: 10Majavah) [18:44:26] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:52:31] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@2326f9c]: Import cirrus indexes to hdfs [18:54:39] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@2326f9c]: Import cirrus indexes to hdfs (duration: 02m 07s) [18:59:15] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:49] (03CR) 10Dzahn: [C: 03+1] "looks easy enough but still makes me wonder if it will need manual actions on each scap::master host when the repo remote changes" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:06:16] (03CR) 10Dzahn: [C: 03+1] "well ok, there are just 2 scap::masters in prod, so can do. not sure about cloud though" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:11:49] (03CR) 10Dzahn: [C: 03+2] scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:14:54] (03CR) 10Dzahn: [C: 03+2] git::clone: Append .git to clone url for gitlab source [puppet] - 10https://gerrit.wikimedia.org/r/850249 (owner: 10Ahmon Dancy) [19:17:38] !log contint* - changing source for scap repo to gitlab - gerrit:850246 T321847 [19:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:44] T321847: Update scap documentation and other references for new GitLab location - https://phabricator.wikimedia.org/T321847 [19:19:28] (03CR) 10Dzahn: [C: 03+2] "this was a complete noop on contint* servers. I did not make any manual changes to the repo/remote config. The real test will be when some" [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:20:00] mutante: Thanks! [19:21:52] no problem. so yea, the real test would be when there is an actual change in the scap repo [19:22:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [19:22:18] (03CR) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:22:33] also it's "ensure present" not "latest",so puppet does not pull [19:22:42] so whatever/whoever does the pull [19:23:17] will see if it causes anything [19:24:32] (03CR) 10Dzahn: [C: 03+2] scap::master: Clone the scap repo from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) (owner: 10Ahmon Dancy) [19:27:53] I'm not even sure what uses /srv/deployment/scap anymore. [19:28:15] (03CR) 10Dzahn: [C: 03+2] miscweb: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850468 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:28:35] ok :) [19:28:39] good enough [19:32:12] Are https://gerrit.wikimedia.org/r/c/operations/puppet/+/850153/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/849699/ on your radar for today? [19:36:17] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37844/parse2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:36:30] (03PS2) 10Dzahn: parsoid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850474 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:41:39] yes, but I really know it when I get to it :) [19:42:13] haha. ok. I'm going to go afk for a bit. I will check back later. [20:03:48] (03PS1) 10Andrew Bogott: wmfkeystonehooks: fix a copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108) [20:03:50] (03PS1) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) [20:05:13] (03CR) 10CI reject: [V: 04-1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:07:06] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:08:36] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: fix a copy/paste error [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:09:54] (03PS2) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) [20:10:38] (03CR) 10CI reject: [V: 04-1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:12:25] (03CR) 10Nskaggs: [C: 03+1] "ports to ports, ips to ips :-)" [puppet] - 10https://gerrit.wikimedia.org/r/850539 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:14:03] (03CR) 10Dzahn: [C: 03+1] admin: add appledora to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron) [20:16:05] (03CR) 10Nskaggs: [C: 03+1] wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [20:19:20] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.14 ms [20:20:23] (03PS3) 10Andrew Bogott: wmfkeystonehooks: convert a couple of config opts to StrOpt [puppet] - 10https://gerrit.wikimedia.org/r/850540 (https://phabricator.wikimedia.org/T288108) [20:22:50] (03PS1) 10Dzahn: devtools: set profile::gitlab::runner::registration_token: private [puppet] - 10https://gerrit.wikimedia.org/r/850541 [20:26:37] (03PS1) 10Dzahn: devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542 [20:27:59] (03PS2) 10Dzahn: devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542 (https://phabricator.wikimedia.org/T313360) [20:28:07] (03PS1) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 [20:30:48] (03CR) 10Dzahn: [C: 03+2] devtools: add profile::phabricator::main::dumps_rsync_clients: [] [puppet] - 10https://gerrit.wikimedia.org/r/850542 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [20:31:48] (03PS2) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 [20:32:40] (03CR) 10Dzahn: [C: 03+2] dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:33:45] (03PS3) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 [20:34:30] (03PS1) 10JHathaway: aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545 [20:35:10] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1001/37848/" [puppet] - 10https://gerrit.wikimedia.org/r/850543 (owner: 10Andrew Bogott) [20:35:14] (03CR) 10Dzahn: [C: 03+2] "/usr/local/bin/kiwix-rsync-cron.sh has been edited by puppet on clouddumps1001" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:35:24] (03CR) 10Dzahn: [C: 03+2] "@nskaggs: noteworthy on clouddumps1001: Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective)" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:37:55] !log clouddumps1001 - puppet run after merging gerrit:848441 for kiwix, changed ferm status from "stopped" to "running". manually ran 'sudo systemctl start kiwix-mirror-update' T57503 [20:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:05] T57503: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 [20:38:38] (03CR) 10JHathaway: [C: 03+2] aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545 (owner: 10JHathaway) [20:38:41] (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: dummy certs & tokens [labs/private] - 10https://gerrit.wikimedia.org/r/850545 (owner: 10JHathaway) [20:39:18] (03CR) 10Dzahn: [C: 03+2] "manually started the dumps service. saw no problems: Main PID: 2688062 (code=exited, status=0/SUCCESS)" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:39:51] (03PS3) 10Dzahn: dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) [20:40:31] (03CR) 10Dzahn: [C: 03+2] dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:42:39] (03PS1) 10JHathaway: aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) [20:42:50] !log clouddumps* - deployed gerrit:848444 - as kind of expected it fails - most likely the project dirs are not automatically created before rsync runs the first time - T57503 [20:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:56] (03PS2) 10JHathaway: aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) [20:44:00] (03CR) 10Dzahn: [C: 03+2] "starting the service manually after this change fails - most likely because the project base dirs are not created automatically before rsy" [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:45:10] (03PS4) 10Andrew Bogott: cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 (https://phabricator.wikimedia.org/T321948) [20:45:15] (03CR) 10Dzahn: [C: 03+2] "eh..no..it's: kiwix-rsync-cron.sh: line 58: $: syntax error: operand expected (error" [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [20:45:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:45:44] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:51:32] (03CR) 10Nskaggs: [C: 03+1] cinder-backup: apply backup api patch to cloudcontrol as well as backup server [puppet] - 10https://gerrit.wikimedia.org/r/850543 (https://phabricator.wikimedia.org/T321948) (owner: 10Andrew Bogott) [20:57:56] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [20:59:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:03:16] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:04:47] (03PS1) 10Dzahn: dumps: fix syntax error in kiwix-rsync-cron.sh [puppet] - 10https://gerrit.wikimedia.org/r/850588 (https://phabricator.wikimedia.org/T57503) [21:05:51] (03CR) 10Dzahn: [C: 03+2] dumps: fix syntax error in kiwix-rsync-cron.sh [puppet] - 10https://gerrit.wikimedia.org/r/850588 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [21:06:13] mutante: I'm lurking again [21:10:12] (03CR) 10JHathaway: [C: 03+2] aux-k8s: ctrl & wrkr roles [puppet] - 10https://gerrit.wikimedia.org/r/850586 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [21:12:38] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10Patch-For-Review, and 2 others: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) @Kelson @nskaggs sync in progress ^ [21:13:25] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10Patch-For-Review, and 2 others: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) I deployed the changes above, a little bugfix follow-up, started the sync service manually. actual command now running on c... [21:16:55] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thanks! https://puppet-compiler.wmflabs.org/pcc-worker1001/37850/registry2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [21:18:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed on registry1003/registry2003" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [21:19:01] (03PS10) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [21:21:03] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:25:06] (03CR) 10Dzahn: "oh.. so.. the gitlab_runners are IPs and the contint hosts are DNS names. so the code is " @resolve((10.64.16.105 10.64.32.184 10.64.48.14" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:26:46] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:27:28] (03PS1) 10JHathaway: aux-k8s: partman config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850590 (https://phabricator.wikimedia.org/T321137) [21:27:53] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:28:06] (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:29:27] (03CR) 10JHathaway: [C: 03+2] aux-k8s: partman config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850590 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [21:30:19] (03CR) 10Dzahn: [C: 03+2] "deployed on doc1002, ferm was restarted (noop on doc2001)" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:30:58] dancy: iptables and rsyncd config should now allow gitlab-runners in addition to contint*. done [21:31:12] Thank you! I'll try to test now [21:31:24] it mixed IPs and host names and put an "resolve" around it.. but it did not seem to be a problem to do that [21:31:46] also mixed IP and host names in 'hosts allow' of rsyncd [21:33:04] (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:36:05] (03PS1) 10JHathaway: aux-k8s: dhcp config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850591 (https://phabricator.wikimedia.org/T321137) [21:37:24] (03CR) 10JHathaway: [C: 03+2] aux-k8s: dhcp config for aux-k8s-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/850591 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [21:39:10] (03PS1) 10Andrew Bogott: add wmcs-securitygroup-backfill [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) [21:40:12] dancy: thanks. it's doc1002 (not 1001 or 200x) fwiw [21:40:59] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37854/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [21:43:12] mutante: I've run out of time. I'll verify on Monday. [21:43:27] dancy: alright, good weekend! [21:47:51] (03PS1) 10Dzahn: ci: move list of contint and zuul hosts to hierdata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [21:48:09] (03PS2) 10Dzahn: ci: move list of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [21:48:35] You too! [21:48:50] (03CR) 10Dzahn: [C: 03+2] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [21:50:28] (03CR) 10CI reject: [V: 04-1] ci: move list of contint and zuul hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [21:51:14] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:52:56] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:56:31] (03PS1) 10JHathaway: aux-k8s: fix typo in pool :( [puppet] - 10https://gerrit.wikimedia.org/r/850595 (https://phabricator.wikimedia.org/T321137) [21:57:13] (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix typo in pool :( [puppet] - 10https://gerrit.wikimedia.org/r/850595 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [22:01:11] bash: rsync: command not found <-- this would be a problem for rsyncing. [22:04:36] (03PS1) 10Dzahn: gitlab::runner: install rsync package [puppet] - 10https://gerrit.wikimedia.org/r/850597 (https://phabricator.wikimedia.org/T321629) [22:06:37] (03CR) 10Dzahn: [C: 04-1] "parameter 'contint_hosts' expects an Array value, got String" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [22:07:32] (03CR) 10Dzahn: [C: 04-1] "shouldn't the data type stay the same when I simply move things around?" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [22:13:06] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10Dzahn) nice, works for me. thanks @BTullis [22:20:47] (03PS1) 10JHathaway: Revert "aux-k8s: fix typo in pool :(" [puppet] - 10https://gerrit.wikimedia.org/r/850568 [22:23:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10Dzahn) @fgiunchedi I think both would be fine, either just don't worry about the duplicate part. I don't see it as a big problem. Or follow the sugg... [22:23:20] (03CR) 10JHathaway: [C: 03+2] Revert "aux-k8s: fix typo in pool :(" [puppet] - 10https://gerrit.wikimedia.org/r/850568 (owner: 10JHathaway) [22:33:47] (03PS1) 10JHathaway: aux-k8s: disable lvs [puppet] - 10https://gerrit.wikimedia.org/r/850604 (https://phabricator.wikimedia.org/T321120) [22:34:42] (03CR) 10JHathaway: [C: 03+2] aux-k8s: disable lvs [puppet] - 10https://gerrit.wikimedia.org/r/850604 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [22:53:45] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:53:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:53:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-ctrl1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:58:10] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) see https://dumps.wikimedia.org/kiwix/zim/ now [22:58:46] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) 05Open→03In progress [22:58:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Dzahn) [22:58:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:59:05] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Dzahn) a:03Dzahn [23:00:55] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:22:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient