[00:02:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T318605)', diff saved to https://phabricator.wikimedia.org/P40024 and previous config saved to /var/cache/conftool/dbconfig/20221117-000215-ladsgroup.json [00:02:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [00:02:21] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:02:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [00:02:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T318605)', diff saved to https://phabricator.wikimedia.org/P40025 and previous config saved to /var/cache/conftool/dbconfig/20221117-000236-ladsgroup.json [00:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323214)', diff saved to https://phabricator.wikimedia.org/P40026 and previous config saved to /var/cache/conftool/dbconfig/20221117-001348-ladsgroup.json [00:13:54] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:28:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:28:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40027 and previous config saved to /var/cache/conftool/dbconfig/20221117-002818-ladsgroup.json [00:28:29] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P40028 and previous config saved to /var/cache/conftool/dbconfig/20221117-002854-ladsgroup.json [00:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:37:41] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:39:46] (03PS6) 10Andrea Denisse: netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) [00:41:28] (03CR) 10Andrea Denisse: [C: 03+2] "Auto approving because the previous approvals were lost after rebasing." [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [00:42:49] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:39] (03PS10) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [00:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P40029 and previous config saved to /var/cache/conftool/dbconfig/20221117-004400-ladsgroup.json [00:45:48] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38268/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [00:48:39] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323214)', diff saved to https://phabricator.wikimedia.org/P40030 and previous config saved to /var/cache/conftool/dbconfig/20221117-005907-ladsgroup.json [00:59:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [00:59:13] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:59:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [00:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T323214)', diff saved to https://phabricator.wikimedia.org/P40031 and previous config saved to /var/cache/conftool/dbconfig/20221117-005929-ladsgroup.json [01:14:57] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318605)', diff saved to https://phabricator.wikimedia.org/P40032 and previous config saved to /var/cache/conftool/dbconfig/20221117-013454-ladsgroup.json [01:34:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40033 and previous config saved to /var/cache/conftool/dbconfig/20221117-013611-ladsgroup.json [01:36:18] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [01:37:39] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:34] (03PS8) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:35] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40034 and previous config saved to /var/cache/conftool/dbconfig/20221117-015000-ladsgroup.json [01:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P40035 and previous config saved to /var/cache/conftool/dbconfig/20221117-015118-ladsgroup.json [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:02:42] (03PS9) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) [02:03:31] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [02:04:47] (03PS10) 10Stang: Update Wikipedia icons to SVG format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788892 (https://phabricator.wikimedia.org/T279645) [02:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40036 and previous config saved to /var/cache/conftool/dbconfig/20221117-020507-ladsgroup.json [02:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P40037 and previous config saved to /var/cache/conftool/dbconfig/20221117-020624-ladsgroup.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323214)', diff saved to https://phabricator.wikimedia.org/P40038 and previous config saved to /var/cache/conftool/dbconfig/20221117-020953-ladsgroup.json [02:09:59] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:13:07] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:01] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T318605)', diff saved to https://phabricator.wikimedia.org/P40039 and previous config saved to /var/cache/conftool/dbconfig/20221117-022013-ladsgroup.json [02:20:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:20:19] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:20:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:21:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40040 and previous config saved to /var/cache/conftool/dbconfig/20221117-022131-ladsgroup.json [02:21:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:21:37] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:21:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40041 and previous config saved to /var/cache/conftool/dbconfig/20221117-022153-ladsgroup.json [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P40042 and previous config saved to /var/cache/conftool/dbconfig/20221117-022500-ladsgroup.json [02:26:54] (03CR) 10Ssingh: [C: 03+1] Update check_fresh_files_in_dir for python3 [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309) (owner: 10BBlack) [02:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P40043 and previous config saved to /var/cache/conftool/dbconfig/20221117-024006-ladsgroup.json [02:52:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40044 and previous config saved to /var/cache/conftool/dbconfig/20221117-025250-ladsgroup.json [02:52:56] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:55:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323214)', diff saved to https://phabricator.wikimedia.org/P40045 and previous config saved to /var/cache/conftool/dbconfig/20221117-025513-ladsgroup.json [02:55:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [02:55:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [02:55:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [02:55:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [02:55:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T323214)', diff saved to https://phabricator.wikimedia.org/P40046 and previous config saved to /var/cache/conftool/dbconfig/20221117-025549-ladsgroup.json [03:07:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P40047 and previous config saved to /var/cache/conftool/dbconfig/20221117-030757-ladsgroup.json [03:11:07] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P40048 and previous config saved to /var/cache/conftool/dbconfig/20221117-032303-ladsgroup.json [03:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323214)', diff saved to https://phabricator.wikimedia.org/P40049 and previous config saved to /var/cache/conftool/dbconfig/20221117-032555-ladsgroup.json [03:26:01] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [03:35:52] (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/857063 [03:38:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323214)', diff saved to https://phabricator.wikimedia.org/P40050 and previous config saved to /var/cache/conftool/dbconfig/20221117-033810-ladsgroup.json [03:38:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:38:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:38:16] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [03:41:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P40051 and previous config saved to /var/cache/conftool/dbconfig/20221117-034102-ladsgroup.json [03:46:51] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas) Hi Jaime, Thank you so much! I have already completed #1. Could I please request you to disable the other Phab account (Username: AbhasT)? [03:56:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P40052 and previous config saved to /var/cache/conftool/dbconfig/20221117-035609-ladsgroup.json [04:06:02] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:11:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323214)', diff saved to https://phabricator.wikimedia.org/P40053 and previous config saved to /var/cache/conftool/dbconfig/20221117-041115-ladsgroup.json [04:11:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [04:11:21] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [04:11:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [04:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40054 and previous config saved to /var/cache/conftool/dbconfig/20221117-041137-ladsgroup.json [04:30:35] (03PS11) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [04:32:28] (03CR) 10Vgutierrez: "Thanks for your review volans" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [04:33:03] (03CR) 10CI reject: [V: 04-1] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [04:34:03] (03PS12) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [04:35:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:35:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [04:35:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40055 and previous config saved to /var/cache/conftool/dbconfig/20221117-043542-ladsgroup.json [04:35:47] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [04:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:37:04] (03CR) 10Vgutierrez: [C: 03+1] "nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [04:38:57] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:43:25] 10SRE, 10Traffic: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez THP has been disabled globally as a result of this task with https://gerrit.wikimedia.org/r/857686. A rolling restart has been performed to applied this change: ` vgu... [04:47:57] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [04:48:40] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [04:55:06] (03PS23) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [05:04:07] (03CR) 10Vgutierrez: [C: 03+1] "LGTM (this check is currently used exclusively on cp instances and all of them have python3 available)" [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309) (owner: 10BBlack) [05:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40056 and previous config saved to /var/cache/conftool/dbconfig/20221117-051357-ladsgroup.json [05:14:02] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:28:37] 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10Vgutierrez) nice work @BCornwall. Current version looks good, if you allow me a small nitpick, we got a small inconsistency between labels on Varnish panels VS HAProxy/A... [05:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P40057 and previous config saved to /var/cache/conftool/dbconfig/20221117-052903-ladsgroup.json [05:30:24] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10Joe) >>! In T321874#8400580, @bking wrote: >>>! In T321874#8399960, @jhathaway wrote: >>> How would this be different under Ansible? >>> >>> * I could render the template... [05:31:09] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:36:22] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10Joe) @bking I've seen others already pointed you to our tools to help your puppet workflow, but if you want to enhance your productivity using puppet, I'm sure both me and... [05:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40058 and previous config saved to /var/cache/conftool/dbconfig/20221117-054045-ladsgroup.json [05:40:50] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P40059 and previous config saved to /var/cache/conftool/dbconfig/20221117-054409-ladsgroup.json [05:55:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P40060 and previous config saved to /var/cache/conftool/dbconfig/20221117-055551-ladsgroup.json [05:59:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40061 and previous config saved to /var/cache/conftool/dbconfig/20221117-055916-ladsgroup.json [05:59:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [05:59:22] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:59:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [05:59:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T323214)', diff saved to https://phabricator.wikimedia.org/P40062 and previous config saved to /var/cache/conftool/dbconfig/20221117-055938-ladsgroup.json [06:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P40063 and previous config saved to /var/cache/conftool/dbconfig/20221117-061058-ladsgroup.json [06:15:11] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:01] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40064 and previous config saved to /var/cache/conftool/dbconfig/20221117-062604-ladsgroup.json [06:26:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:26:10] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [06:26:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [06:26:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:26:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T323214)', diff saved to https://phabricator.wikimedia.org/P40065 and previous config saved to /var/cache/conftool/dbconfig/20221117-062643-ladsgroup.json [06:32:13] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:31] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:45:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:04] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0700). [07:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323214)', diff saved to https://phabricator.wikimedia.org/P40066 and previous config saved to /var/cache/conftool/dbconfig/20221117-070202-ladsgroup.json [07:02:08] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:10:09] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P40067 and previous config saved to /var/cache/conftool/dbconfig/20221117-071708-ladsgroup.json [07:18:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: Switch lvs state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:19:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] apple-search: Remove service from lb and backend [puppet] - 10https://gerrit.wikimedia.org/r/857691 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:19:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Please add a followup patch to remove data from conftool as well" [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:21:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38269/console" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:24:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "merging this patch will remove the DNS discovery state files from the dns servers:" [puppet] - 10https://gerrit.wikimedia.org/r/852210 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:25:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] apple-search: Remove DNS records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [07:28:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323214)', diff saved to https://phabricator.wikimedia.org/P40068 and previous config saved to /var/cache/conftool/dbconfig/20221117-072832-ladsgroup.json [07:28:38] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:32:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P40069 and previous config saved to /var/cache/conftool/dbconfig/20221117-073215-ladsgroup.json [07:36:49] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P40070 and previous config saved to /var/cache/conftool/dbconfig/20221117-074339-ladsgroup.json [07:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323214)', diff saved to https://phabricator.wikimedia.org/P40071 and previous config saved to /var/cache/conftool/dbconfig/20221117-074721-ladsgroup.json [07:47:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [07:47:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [07:47:27] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:47:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40073 and previous config saved to /var/cache/conftool/dbconfig/20221117-074732-ladsgroup.json [07:47:55] !log restart kube-apiserver on ml-serve-ctrl2002 - high LIST latencies for knative, attempt to clear them out [07:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P40074 and previous config saved to /var/cache/conftool/dbconfig/20221117-075845-ladsgroup.json [07:59:13] (03PS5) 10Giuseppe Lavagetto: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [07:59:15] (03PS3) 10Giuseppe Lavagetto: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 [08:00:04] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0800). [08:00:14] morning! there are no trainees signed up for the window, and no patches scheduled for deployment either. [08:00:38] so.... see y'all next time! [08:11:10] (03CR) 10Filippo Giunchedi: "On balance I think the current approach of webserver redirect is more maintainable, unless there are significant drawbacks I'm missing?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [08:13:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323214)', diff saved to https://phabricator.wikimedia.org/P40075 and previous config saved to /var/cache/conftool/dbconfig/20221117-081352-ladsgroup.json [08:13:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:13:57] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:14:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:14:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T323214)', diff saved to https://phabricator.wikimedia.org/P40076 and previous config saved to /var/cache/conftool/dbconfig/20221117-081413-ladsgroup.json [08:14:41] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:21:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd2002.codfw.wmnet to drbd [08:22:12] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10ops-monitoring-bot) VM kubestagetcd2002.codfw.wmnet switching disk type to drbd [08:23:07] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:25:38] (03CR) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:25:46] (03CR) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:31:40] (03PS6) 10Giuseppe Lavagetto: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [08:31:42] (03PS4) 10Giuseppe Lavagetto: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 [08:31:44] (03PS2) 10Giuseppe Lavagetto: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 [08:31:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd2002.codfw.wmnet to drbd [08:32:09] PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:13] RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms [08:33:24] (03PS1) 10Jelto: gitlab_runner: make one Shared Runner canary [puppet] - 10https://gerrit.wikimedia.org/r/858188 [08:34:06] (03CR) 10David Caro: ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:34:08] (03PS2) 10Jelto: gitlab_runner: make one Shared Runner canary [puppet] - 10https://gerrit.wikimedia.org/r/858188 [08:34:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "This actually made puppet fail on pki-intermediate.pki.eqiad1.wikimedia.cloud though I'm confused as to why:" [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [08:37:10] (03CR) 10Jelto: "We added this ad-hoc yesterday while testing. It would be nice if we can address one of the Shared Runners explicitly with a special tag. " [puppet] - 10https://gerrit.wikimedia.org/r/858188 (owner: 10Jelto) [08:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T323214)', diff saved to https://phabricator.wikimedia.org/P40078 and previous config saved to /var/cache/conftool/dbconfig/20221117-084321-ladsgroup.json [08:43:27] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:47:39] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:50:15] !log draining ganeti1019 for eventual reimage T311687 [08:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:20] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [08:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40079 and previous config saved to /var/cache/conftool/dbconfig/20221117-085108-ladsgroup.json [08:51:13] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:53:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756 (owner: 10Slyngshede) [08:54:52] (03CR) 10David Caro: [C: 03+1] ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:55:08] !log krinkle@deploy1002 Started deploy [integration/docroot@de83506]: (no justification provided) [08:55:48] !log krinkle@deploy1002 Finished deploy [integration/docroot@de83506]: (no justification provided) (duration: 00m 39s) [08:58:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P40080 and previous config saved to /var/cache/conftool/dbconfig/20221117-085828-ladsgroup.json [08:58:35] I am going to upgrade Gerrit [09:00:04] hashar: Dear deployers, time to do the Gerrit 3.5 upgrade deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0900). [09:00:11] (03PS1) 10Elukey: benthos: add snappy compression to kafka output settings for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/858191 (https://phabricator.wikimedia.org/T319214) [09:01:22] (03CR) 10Slyngshede: [C: 03+2] If bug in configuration parser. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756 (owner: 10Slyngshede) [09:01:24] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: add snappy compression to kafka output settings for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/858191 (https://phabricator.wikimedia.org/T319214) (owner: 10Elukey) [09:01:33] (03CR) 10Slyngshede: [V: 03+2] If bug in configuration parser. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756 (owner: 10Slyngshede) [09:01:35] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] If bug in configuration parser. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/857756 (owner: 10Slyngshede) [09:01:48] hashar: do you have https://xkcd.com/303/ for "Gerrit is being upgraded"? [09:02:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd2002.codfw.wmnet to plain [09:02:40] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10ops-monitoring-bot) VM kubestagetcd2002.codfw.wmnet switching disk type to plain [09:02:42] (03CR) 10Elukey: [C: 03+2] benthos: add snappy compression to kafka output settings for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/858191 (https://phabricator.wikimedia.org/T319214) (owner: 10Elukey) [09:02:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd2002.codfw.wmnet to plain [09:03:22] vgutierrez: it is more like https://xkcd.com/2217/ :D [09:03:23] 10SRE, 10ops-codfw: Broken disk on ganeti2013 - https://phabricator.wikimedia.org/T323220 (10MoritzMuehlenhoff) >>! In T323220#8400891, @Dzahn wrote: > possibly duplicate of automatically generated T323222 Indeed, thanks. Merging. [09:04:20] !log hashar@deploy1002 Started deploy [gerrit/gerrit@39d9f06]: Gerrit to 3.5.4 on gerrit2002 [09:04:30] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@39d9f06]: Gerrit to 3.5.4 on gerrit2002 (duration: 00m 10s) [09:05:59] (03PS2) 10Vgutierrez: monitoring: Update check_fresh_files_in_dir for python3 [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309) (owner: 10BBlack) [09:06:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P40081 and previous config saved to /var/cache/conftool/dbconfig/20221117-090615-ladsgroup.json [09:07:13] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10MoritzMuehlenhoff) The server can be taken down for troubleshooting anytime, I removed it from active service. I saw kernel messages on the console pointint to a broken /dev/sdc. I realise the server is out of warra... [09:07:56] !log Bringing back Gerrit on gerrit2002 [09:07:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:03] PROBLEM - ganeti-noded running on ganeti2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:09:22] !log Upgrading Gerrit primary instance [09:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:01] PROBLEM - ganeti-mond running on ganeti2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [09:10:19] PROBLEM - ganeti-confd running on ganeti2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:10:56] !log hashar@deploy1002 Started deploy [gerrit/gerrit@39d9f06]: Gerrit to 3.5.4 on gerrit1001 [09:11:04] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@39d9f06]: Gerrit to 3.5.4 on gerrit1001 (duration: 00m 08s) [09:12:38] !log Bringing back primary Gerrit on gerrit1001 [09:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:13:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P40082 and previous config saved to /var/cache/conftool/dbconfig/20221117-091334-ladsgroup.json [09:13:55] (03PS1) 10Giuseppe Lavagetto: blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 [09:14:33] Gerrit is back [09:16:10] hashar: yep, but it isn't working as expected at least here [09:16:25] I'm just getting header/footer in https://gerrit.wikimedia.org/r/dashboard/self [09:16:30] no content at all [09:16:58] same for https://gerrit.wikimedia.org/r/q/status:open+-is:wip [09:17:07] even after manually closing my session [09:17:35] <_joe_> jouncebot: now and next [09:17:35] For the next 0 hour(s) and 42 minute(s): Gerrit 3.5 upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T0900) [09:17:37] vgutierrez: it's working for me (but I had to force a reload, refresh browser cache) [09:17:39] fixed after opening a new session... [09:17:45] yeah, caching issues [09:17:47] vgutierrez: that might be a cache issue [09:18:09] lot of the UI is changed and I guess some javascript bits might be cached in the browser [09:18:17] <_joe_> hashar: can you let me know when you're 100% done? [09:18:28] I have tried with an incognito window in a different server while logged in and the UI seems to work [09:18:29] (03CR) 10Vgutierrez: [C: 03+2] monitoring: Update check_fresh_files_in_dir for python3 [puppet] - 10https://gerrit.wikimedia.org/r/857623 (https://phabricator.wikimedia.org/T321309) (owner: 10BBlack) [09:18:29] <_joe_> jnuche and I would need to run some tests [09:18:49] _joe_: it is upgraded now I have to check various things and metrics [09:18:52] <_joe_> I didn't need to reload FWIW [09:19:01] yeah I haven't had to reload either [09:19:01] <_joe_> hashar: it's ok, just let me know when you're done [09:19:39] (03CR) 10Clément Goubert: apple-search: Remove DNS records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [09:19:41] (03PS8) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [09:20:34] (03CR) 10CI reject: [V: 04-1] apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [09:20:39] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:20:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [09:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P40083 and previous config saved to /var/cache/conftool/dbconfig/20221117-092121-ladsgroup.json [09:21:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) > Could I please request you to disable the other Phab account (Username: AbhasT)? Just to make sure I am doing the right thing, I believe you mean to request disabling @Atripathi, right? T... [09:21:33] Oh, woops, wasn´t supposed to use gerrit? [09:21:38] Because it works for me rn :P [09:21:57] it is back yes ;) I am merely checking various metrics [09:26:20] (03PS9) 10Clément Goubert: apple-search: Remove DNS records [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) [09:27:12] (03CR) 10Clément Goubert: apple-search: Remove DNS records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/852208 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [09:27:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [09:28:10] (03CR) 10Clément Goubert: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/858207 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [09:28:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T323214)', diff saved to https://phabricator.wikimedia.org/P40084 and previous config saved to /var/cache/conftool/dbconfig/20221117-092841-ladsgroup.json [09:28:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:28:46] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [09:28:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40085 and previous config saved to /var/cache/conftool/dbconfig/20221117-092902-ladsgroup.json [09:29:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [09:31:19] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:22] ^ expected due to ganeti2016 reboot [09:31:27] PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:29] RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [09:35:31] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [09:36:11] _joe_: I think Gerrit is good. It still reindexing every single changes though which will take an additional 20 minutes or so ( https://grafana.wikimedia.org/d/Zh_ncGsWk/queues-upstream?orgId=1&refresh=1m&viewPanel=18&from=now-1h&to=now ) [09:36:23] so maybe some search queries might not yield the expected output [09:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40086 and previous config saved to /var/cache/conftool/dbconfig/20221117-093628-ladsgroup.json [09:36:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [09:36:33] hashar: when I do any write action (submit +2, rebase, remove +2 vote) I get a modal error 500 Internal Server ERror [09:36:34] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [09:36:39] but then when I refresh, it actually goes through [09:36:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [09:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T323214)', diff saved to https://phabricator.wikimedia.org/P40087 and previous config saved to /var/cache/conftool/dbconfig/20221117-093650-ladsgroup.json [09:36:55] ah good point, I haven't checked yet the Gerrit error log [09:37:39] looks like it fails somewhere between recording the +2 and triggering events because CI isn't running inresponse to +2 [09:37:43] e.g. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/701336 [09:37:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [09:37:58] the +2 comment is there but it actually got a 500 err when I submitted it [09:38:10] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10Abhas) Yes, I meant delete user @Atripathi and the LDAP username associated with it. I will use @Abhas here. [09:38:30] leaving any comment yields Http 500 [09:38:42] ah [09:38:53] _joe_: so well it is not working :`\ [09:39:19] <_joe_> hashar: ok, we'll hold our horse [09:40:04] Caused by: java.io.IOException: No space left on device [09:40:06] what the hell [09:40:19] `/` is full [09:41:36] Comments, editing a commit message through web edit, anything involving writes from the UI apparently [09:42:47] ah well if the disk's full that'd explain it [09:42:55] hashar: /var/lib/gerrit2/review_site/cache is 17G, not sure why is on / [09:42:59] !log Cleaning gerrit1001.wikimedia.org `/` partition [09:42:59] isntead of /srv [09:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:30] yup we should probably move the cache to the /srv partition [09:45:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [09:47:43] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:48:14] <_joe_> vgutierrez: case in point ^^ [09:48:24] <_joe_> the rps in codfw for POST is like 4 rps [09:48:28] <_joe_> and it flaps [09:49:00] there is some lock being held on the gerrit_file_diff disk cache org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "DATA"; SQL statement: [09:49:04] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Atripathi - https://phabricator.wikimedia.org/T323207 (10jcrespo) Thank you for bearing with me! Old account is now disabled on Phabricator and everywhere else. Further requests should go much smoother! Sorry for the complications. Have a nice day and enjo... [09:50:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:52:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [09:52:21] I am stopping gerrit to flush that lock [09:53:14] ack [09:54:03] the full disk on `/` clearly did not help :-\ [09:55:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [09:56:27] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.006 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:56:30] !log Stopped Gerrit and running offline reindexing [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:41] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:53] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:11] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:57:19] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:24] I swear I have disabled monitoring for gerrit1001/gerrit2002 [09:57:29] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:17] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:19] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:45] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:45] (JobUnavailable) firing: (3) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:47] 33% done [10:02:05] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:45] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [10:04:59] 72% done [10:06:05] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01136 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:06:09] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:28] I am pretty sure all those alarms are due to units/processes trying to reach out to gerrit [10:07:18] I'd say so too yeah [10:07:51] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:26] Yep, they're all git pulls of some sort [10:09:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:09:34] <_joe_> uhm [10:09:58] That one's not though [10:10:10] indeed not [10:10:26] I ack it to avoid esclation [10:10:53] already done by Joe :) [10:12:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [10:13:55] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:07] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:31] ^ expected due to ganeti2019 reboot [10:17:14] Gerrit reindexing is 80% done [10:19:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:19:43] !log gerrit1001: removed 5G of 2019's thread dumps in `/srv/home-cobalt.wikimedia.org/thcipriani/threaddumps` [10:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:03] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [10:20:27] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.94 ms [10:20:30] !log installing gnutls28 security updates on Buster [10:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] it is still reindexing 684k out of 847k changes [10:23:59] hashar: I sent a "stand-by" email to wikitech-l btw [10:24:11] thank you! [10:24:44] I think everything boils down to `/` being filed [10:24:58] I will file some tasks to clear up disk and relocate data to `/srv` [10:24:59] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:25:55] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:55] !log pool ats-be@cp2042 [10:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:17] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40089 and previous config saved to /var/cache/conftool/dbconfig/20221117-103153-ladsgroup.json [10:31:58] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [10:33:05] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:25] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:36:07] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:01] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:25] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:53] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:00] I am tempted to stop the full reindexing and restart gerrit so at least people can send changes and fetch [10:43:09] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:26] it is taking a while (currently at 83% or 709k / 847k changes) [10:43:55] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:40] !log restarting apache/FPM on mw canaries to pick up gnutls security updates [10:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:28] fascinating to know that gerrit down means ~3% of the fleet fails puppet [10:46:52] I'm looking at https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-1h&to=now [10:47:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P40090 and previous config saved to /var/cache/conftool/dbconfig/20221117-104659-ladsgroup.json [10:47:17] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:29] <_joe_> godog: git::clone :) [10:48:25] indeed [10:49:10] I guess next time there is a full reindex needed, I will do the upgrade over the week-end [10:49:12] :-\ [10:52:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323214)', diff saved to https://phabricator.wikimedia.org/P40091 and previous config saved to /var/cache/conftool/dbconfig/20221117-105254-ladsgroup.json [10:52:55] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:00] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [10:55:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [10:57:04] hashar: The last 20% seem to be taking as long as the rest of the reindexing [10:57:09] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:16] They took the 80/20 rule a bit too litteraly [10:57:59] it is at 9% [10:58:02] it is at 99% [10:58:16] I think cause a good chunk of them already got reindexed while Gerrit 3.5 was only for like 30/40 minutes [10:58:31] Right [10:59:25] it has completed the `changes` reindexing, 203 failed and I filed a task for them [10:59:34] it is still out of disk space :( [10:59:55] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [11:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1100). [11:00:34] quite an aggressive jouncebot today [11:01:17] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 58823 bytes in 0.133 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:01:35] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:43] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:53] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [11:01:57] hmm gerrit is back! [11:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P40092 and previous config saved to /var/cache/conftool/dbconfig/20221117-110206-ladsgroup.json [11:02:19] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.039 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:02:27] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:34] vgutierrez: I'd wait on hashar's go before doing anything because it apparently still had disk space issues [11:02:41] yeah it is messy [11:02:45] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:47] I really have to relocate all those files out of / [11:02:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [11:02:59] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:45] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:23] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:16] <_joe_> hashar: do you need SRE help? [11:06:36] at some point for sure since I will have to relocate files to the larger partition /srv [11:06:40] which is itself too small :-\ [11:06:59] <_joe_> no I mean right now, what's the status? [11:07:05] it is backup [11:07:08] the reindexing has completed [11:07:12] <_joe_> ok [11:07:15] <_joe_> so no more issues? [11:07:27] I am watching it [11:07:49] <_joe_> I see disk usage spiking [11:07:50] root's got like 2GB left [11:07:59] yeah that is the problem [11:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P40093 and previous config saved to /var/cache/conftool/dbconfig/20221117-110801-ladsgroup.json [11:08:13] <_joe_> hashar: if reindexing is over I'd expect it to stabilize? [11:08:55] the H2 database caches are in /var/lib/gerrit2/review_site/cache/ and currently occupies 26G out of a 46G partition [11:09:05] yes that is my expectation as well [11:09:12] <_joe_> it isn't though [11:09:47] <_joe_> we're going to run out of space soon [11:10:08] yes [11:10:12] there's 80G free in the vg [11:10:24] oh [11:10:28] ah but / isn't on the lvm [11:10:33] so yes nevermind, sorry [11:10:33] <_joe_> godog: yes [11:10:45] <_joe_> we need to move this dir to the VG [11:11:15] <_joe_> git_file_diff.h2.db is 8 gb and gerrit_file_diff.h2.db is 12 gb [11:11:34] so I guess create a `/srv/gerrit/cache` directory owned by gerrit2. Move the content of `/var/lib/gerrit2/review_site/cache` to it and leave a symlink [11:11:37] ACKNOWLEDGEMENT - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Adding this server to the list of other servers exhibiting this behaviour. https://wikitech.wikimedia.org/wiki [11:11:37] %23Monitoring [11:11:50] <_joe_> hashar: do we strictly need to move it? [11:11:59] <_joe_> if it's a cache, can't we start from scratch? [11:12:02] then they will no more be on the / partition [11:12:06] ah [11:12:11] yeah I guess I can flush it [11:12:19] <_joe_> anyways [11:12:36] or limit their size maybe [11:12:37] We can just create a new partition on the vg, copy the files over and mount on /var/lib/gerrit2 [11:12:48] <_joe_> claime: that's what I wanted to do yes [11:12:49] But that may not be the usual way y'all do things :p [11:13:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [11:13:23] <_joe_> claime: ok, you can do it on the live server, then work on the parted recipe [11:13:37] <_joe_> live serverS, you should start from 2001 [11:13:49] <_joe_> 2002, sorry [11:14:16] _joe_: ack, you'll have to show me the parted recipe stuff [11:14:22] <_joe_> ah 2002 has root on lvs [11:14:23] FWIW either works, I personally would go the /srv route and there's advantage of not having yet another partman recipe [11:14:26] <_joe_> rotfl [11:14:32] yes that's the standard recipe [11:14:43] <_joe_> godog: yeah I think the standard recipe is ok then [11:15:02] And has a 73GB root so [11:15:16] yeah gerrit1001 isn't on that unfortunately yet, it'll be of course eventually [11:15:19] <_joe_> yeah clearly different hw generations, installed at different times [11:15:29] Failover to 2002, reimage 1001 ? [11:15:45] Or is that a pita ? [11:15:58] <_joe_> +1 from me but I'd also ask jelto ot weigh in :) [11:16:19] I'll step back but feel free to ping if needed [11:16:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:17:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323214)', diff saved to https://phabricator.wikimedia.org/P40094 and previous config saved to /var/cache/conftool/dbconfig/20221117-111712-ladsgroup.json [11:17:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:17:18] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [11:17:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:17:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40095 and previous config saved to /var/cache/conftool/dbconfig/20221117-111745-ladsgroup.json [11:17:46] <_joe_> hashar: should we go with the failover of gerrit? [11:17:56] (03PS2) 10Effie Mouzeli: profile::maps: remove chgrp_log [puppet] - 10https://gerrit.wikimedia.org/r/857697 (owner: 10Hnowlan) [11:18:15] I haven't done that in a while :\ [11:18:21] (03CR) 10Jcrespo: "Apologies for the delay- this was initially going to be a focus of the current quarter, but other things got in the way, as usual. I still" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/819109 (https://phabricator.wikimedia.org/T274463) (owner: 10Jcrespo) [11:19:15] given /srv has 72G free, I think it is easier to move the caches there [11:19:44] Easier but we end up in symlink-land [11:19:47] PROBLEM - Disk space on gerrit1001 is CRITICAL: DISK CRITICAL - free space: / 1321 MB (2% inode=91%): /tmp 1321 MB (2% inode=91%): /var/tmp 1321 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops [11:20:01] hashar: ^^ :) [11:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [11:21:19] <_joe_> claime: let's go with creating another volume [11:21:23] <_joe_> as a stopgap [11:21:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:21:37] ack [11:21:37] <_joe_> then serviceops collab can decide how to proceed [11:21:49] <_joe_> will you do the honour? [11:21:56] (03PS7) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [11:22:01] on it, only for 1001 right ? [11:22:06] <_joe_> else I can create the LV [11:22:17] <_joe_> but we'll need to stop gerrit for the final copy of the data ofc [11:22:20] yes [11:22:36] 50G ok for the new part ? [11:22:42] <_joe_> yes [11:22:54] the caches hold 25G right now [11:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P40096 and previous config saved to /var/cache/conftool/dbconfig/20221117-112307-ladsgroup.json [11:23:24] <_joe_> /var/lib/gerrit2 is 34G [11:23:33] and I kinid of suspect we have been carrying those cache files since ever and they have lot of empty space [11:23:51] <_joe_> hashar: that's why asked about flushing them [11:24:27] (03PS1) 10Jbond: Revert "hieradata: move multirootca standard settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/858266 [11:24:36] Just finished reading the scrollback, is there a reason we cannot flush the cache? [11:25:59] <_joe_> sobanski: well I'm not 100% sure it can be flushed without consequences [11:26:07] Ah [11:26:08] <_joe_> so I have aproposal [11:26:10] create [11:26:14] (03PS2) 10Jbond: Revert "hieradata: move multirootca standard settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/858266 [11:26:15] <_joe_> claime: ok [11:26:17] I think the flush cache command only flush the in memory cache [11:26:18] <_joe_> my proposal is [11:26:20] created, mounted on mnt, rsyncing [11:26:31] <_joe_> we first rsync without the cache dir [11:26:46] Hmm I'm rsyncing all /var/lib/gerrit2 rn [11:26:55] naaa [11:27:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [11:27:46] the two big ones would be /var/lib/gerrit2/review_site/index (the Lucene index) and /var/lib/gerrit2/review_site/cache (H2 database persistent caches) [11:27:51] <_joe_> yeah I can confirm that it should be safe to flush the cache [11:28:05] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004941 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:28:05] which takes respectively 7G and 26G [11:28:08] <_joe_> the h2 caches definitely need to be flushed [11:28:31] review_site/cache/gerrit_file_diff.h2.db is 12G [11:28:37] <_joe_> https://gerrit-review.googlesource.com/Documentation/pgm-init.html init --delete-caches it looks like [11:28:39] I think we have been carrying those h2 database file since forever since everytime we upgrade or change host we rsync all files [11:29:26] <_joe_> hashar: yeah ok, for now let's not cause more disruption [11:29:30] <_joe_> the copy is going on fast [11:29:37] "Note that re-creation of these caches may be expensive" [11:29:53] <_joe_> we'll tell you when it's time to stop gerrit [11:29:56] Unsurprisingly [11:29:56] <_joe_> and to restart it [11:30:08] (03PS3) 10Jbond: Revert "hieradata: move multirootca standard settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/858266 [11:30:14] _joe_: is there a puppet fstab patch to make or we ? [11:30:18] w/e* [11:30:19] claime: you should only copy the cache directory [11:30:23] a [11:30:27] <_joe_> claime: puppet doesn't manage fstab :P [11:30:31] hashar: Honestly it's too late [11:30:37] (03CR) 10Jbond: [C: 03+2] Revert "hieradata: move multirootca standard settings to profile" [puppet] - 10https://gerrit.wikimedia.org/r/858266 (owner: 10Jbond) [11:30:55] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:02] <_joe_> uh sigh [11:31:04] <_joe_> jayme: ^^ [11:31:06] I'm already rsyncing all of /var/lib/gerrit2 and it's about done [11:31:15] (03CR) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:31:18] <_joe_> so once it's done [11:31:22] <_joe_> turn off gerrit [11:31:25] <_joe_> rsync again [11:31:29] <_joe_> remount [11:31:33] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:31:44] ah you want to mount the whole of /var/lib/gerrit2 ? [11:31:44] <_joe_> well wipe out /var/lib/gerrit2 on / first [11:32:41] rsync done [11:32:44] ^ expected due to ganeti2024 reboot [11:32:45] fstab ready [11:32:45] <_joe_> ok [11:32:52] <_joe_> hashar: please turn off gerrit [11:32:55] hashar: you can stop gerrit [11:33:12] <_joe_> and tell us once it's done :) [11:33:15] _joe_: could we only mount the cache directory after all [11:33:24] it's too laaaaaaaaaaaaaaaaaaate [11:33:25] Submitted before finished the sentence [11:33:25] disabling puppet and gerrit [11:33:49] I've already synced up the whole gerrit2 dir [11:33:58] <_joe_> yeah aand it's frankly a better choice [11:34:03] I have stopp ed it and `lsof /var/lib/gerrit2` reports nothing [11:34:07] ack [11:34:07] nvm, we'll copy the rest back if we need to [11:34:12] rsyncing [11:34:12] <_joe_> ok [11:34:20] <3 [11:34:33] <_joe_> it's going to take time as most db cache files will have changed [11:34:52] yeah [11:34:56] a bit [11:35:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [11:35:42] <_joe_> claime: once that's done, we need to remove data from the on-root /var/lib/gerrit [11:35:44] <_joe_> mount [11:35:48] <_joe_> run puppet [11:35:49] yes [11:36:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:36:03] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [11:36:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:36:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P40097 and previous config saved to /var/cache/conftool/dbconfig/20221117-113621-ladsgroup.json [11:36:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:36:33] hashar: do we have a tracking task for this issue or should I create one? [11:36:39] I am around if recoveries are needed- remember we have hourly backups [11:36:40] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) These servers are racked waiting to be imaged not sure if @papaul or @robh can assist imaging these and getting them handed over [11:36:45] jynus: ack [11:36:53] sobanski: I have been using the Gerrit 3.5 upgrade task https://phabricator.wikimedia.org/T307334 [11:37:11] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [11:37:14] well /mnt is full :\ [11:37:39] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1532 bytes in 0.007 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:37:50] That's just rsync being a dick [11:37:55] PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:37:55] <_joe_> yes [11:38:06] sobanski: I guess we can get a child one for the disk space issue [11:38:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T323214)', diff saved to https://phabricator.wikimedia.org/P40098 and previous config saved to /var/cache/conftool/dbconfig/20221117-113814-ladsgroup.json [11:38:19] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [11:38:23] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:25] hashar: Doing that right now [11:38:27] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1532 too small - 1532 bytes in 0.008 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:38:33] sobanski: thank you! [11:38:35] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The following units failed: gerrit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:57] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:45] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:45] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:39:58] <_joe_> ok so [11:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P40099 and previous config saved to /var/cache/conftool/dbconfig/20221117-114013-ladsgroup.json [11:40:21] <_joe_> claime: rsync is done right? [11:40:44] yeah sorry connection hiccup [11:40:54] <_joe_> should I proceed? [11:41:05] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:41:05] I'm on it [11:41:14] <_joe_> 👍 [11:41:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [11:41:45] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:51] ARGH the rm jitters [11:42:51] Who's on /mnt [11:43:02] <_joe_> claime: sorry, me :P [11:43:10] <_joe_> unmounted [11:43:13] Thanks *huffs* [11:43:45] <_joe_> now run-puppet-agent --force [11:43:55] that starts Gerrit [11:43:56] <_joe_> that should start gerrit up [11:44:50] <_joe_> claime: ^^ [11:44:55] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:03] ok it's got a -f [11:45:21] damn wrapper scripts I haven't got the hang off yet [11:45:27] puppet running [11:45:47] RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-11-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [11:45:49] Notice: /Stage[main]/Gerrit/Systemd::Service[gerrit]/Service[gerrit]/ensure: ensure changed 'stopped' to 'running' (corrective) [11:45:51] <_joe_> hashar: once it's back up and stable [11:45:56] gerrit started [11:46:01] <_joe_> you should update wikitech-l [11:46:07] <_joe_> there was some confusion earlier [11:46:14] /dev/mapper/gerrit1001--vg-gerrit 49G 34G 13G 73% /var/lib/gerrit2 [11:46:21] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:22] Loaded for me now [11:46:25] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53356 bytes in 0.066 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:46:29] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:32] thanks ! [11:46:34] Free PE / Size 7655 / 29.90 GiB [11:46:40] If needed we can still stretch a bit [11:46:44] <_joe_> so we can grow it if needed yes [11:46:51] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:55] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:01] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.5.4 (APACHE-SSHD-2.8.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [11:47:09] Follow up task created: https://phabricator.wikimedia.org/T323262 [11:47:29] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.037 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [11:47:38] If everything's good for y'all, I'm gonna go have a coffee and a smoke [11:47:41] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:41] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5032.eqsin.wmnet [11:47:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5032.eqsin.wmnet [11:48:03] claime, _joe_: thanks! [11:48:11] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:48:15] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:25] sobanski: np ;) [11:48:37] (03CR) 10Giuseppe Lavagetto: "Please see the inline question." [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [11:48:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [11:48:55] <_joe_> the new gerrit UI, uh sigh [11:49:24] <_joe_> now global comments are even more akin to comments in code [11:49:45] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01087 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:49:45] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:50:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:50:26] (03CR) 10Clément Goubert: [V: 03+1] apple-search: Remove service from service::catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857706 (https://phabricator.wikimedia.org/T316296) (owner: 10Clément Goubert) [11:50:44] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [11:51:54] what's the current status of gerrit- is it belived to be stable now? [11:52:37] (03CR) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [11:52:42] or maintenance (e.g. index update) is still ongoing (to update topic)? [11:53:18] ^ hashar [11:55:02] (03CR) 10Giuseppe Lavagetto: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [11:55:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40100 and previous config saved to /var/cache/conftool/dbconfig/20221117-115520-ladsgroup.json [11:55:57] jynus: it should be back [11:56:01] I am back to monitoring it [11:56:08] and I filed a placeholder incident report https://wikitech.wikimedia.org/wiki/Incidents/2022-11-17_Gerrit_3.5_upgrade [11:56:08] (03CR) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [11:56:13] thanks [11:56:40] _joe_: claime: thank you very much! [11:56:54] hashar: np <3 [11:57:54] hashar: for an internal service just updating our workmates I belive was necessary, but I am not going to prevent you from writing a report if you want! [12:01:09] !log [urbanecm@mwmaint1002 ~]$ time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=enwiki # T318457 [12:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [12:01:37] !log Gerrit back since 11:45 UTC [12:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:41] RECOVERY - Disk space on gerrit1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gerrit1001&var-datasource=eqiad+prometheus/ops [12:02:06] !log [urbanecm@mwmaint1002 ~]$ time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=trwiki # T318457 [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:36] (03CR) 10MVernon: [C: 03+2] rewrite.py: changes for Phonos deployment [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [12:06:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [12:06:11] !log restart swift proxies to deploy phonos changes to rewrite.py T317417 [12:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:17] T317417: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 [12:06:22] I have send emails to ops and wikitech-l [12:06:33] will write the incident report after lunhc [12:07:17] no worries, hashar, thank you for your work, as I said on Slack [12:07:27] hashar: <3 have a good lunchbreak [12:07:28] (03CR) 10David Caro: ceph: osd: introduce support for single NIC setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:07:41] and let me know if I can help with the report [12:07:41] guess next time I will run `df` before upgrading! [12:08:05] time for me to plug my "Delete All The Things" policy again ;-) [12:08:07] and find a way to not disable all monitoring probes before doing the upgrade [12:09:43] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I looked at this and it looks good, but we didn’t test!" [puppet] - 10https://gerrit.wikimedia.org/r/856653 (owner: 10Ebernhardson) [12:10:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P40101 and previous config saved to /var/cache/conftool/dbconfig/20221117-121026-ladsgroup.json [12:11:23] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005929 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:12:28] mmmm, puppet failure fleet wide, maybe? [12:12:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [12:12:35] ahm no, it recovered [12:12:59] (03PS1) 10Muehlenhoff: Set role contacts for webperf* roles to o11y [puppet] - 10https://gerrit.wikimedia.org/r/858294 [12:13:12] maybe it was a temporary falloff of gerrit [12:13:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [12:13:50] !log rolling restart of A:wikidough to pick up security updates [12:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:41] (03PS1) 10Muehlenhoff: Use default mail relay for miscweb* hosts [puppet] - 10https://gerrit.wikimedia.org/r/858297 [12:18:59] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [12:21:04] (03CR) 10Volans: "Initial comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [12:22:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [12:23:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [12:24:32] !log restarting slapd on serpens/seaborgium/ldap-corp* to pick up GNUTLS update [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318955)', diff saved to https://phabricator.wikimedia.org/P40103 and previous config saved to /var/cache/conftool/dbconfig/20221117-122532-ladsgroup.json [12:25:38] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:29:48] !log installing bluez security updates [12:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [12:31:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40104 and previous config saved to /var/cache/conftool/dbconfig/20221117-123128-ladsgroup.json [12:31:33] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [12:32:37] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@3bb99c2]: (no justification provided) [12:32:42] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@3bb99c2]: (no justification provided) (duration: 00m 05s) [12:38:58] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move multirootca standard settings to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857667 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [12:40:33] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P40105 and previous config saved to /var/cache/conftool/dbconfig/20221117-124634-ladsgroup.json [12:51:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: compute: cleanup unused code [puppet] - 10https://gerrit.wikimedia.org/r/858327 [12:51:30] (03PS1) 10Arturo Borrero Gonzalez: cloudvirts: make them use single NIC by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [12:55:05] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@4bdda20]: (no justification provided) [12:55:23] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@4bdda20]: (no justification provided) (duration: 00m 18s) [12:58:23] (03CR) 10Hokwelum: "Hello Daniel, Thanks once again! The dumps run don't use this data and pcc showed there were no content differences on the functionality t" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [13:01:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P40106 and previous config saved to /var/cache/conftool/dbconfig/20221117-130141-ladsgroup.json [13:01:48] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/858330 [13:01:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [13:02:12] (03PS1) 10Jbond: apereo_cas: add new fact to detect cas version [puppet] - 10https://gerrit.wikimedia.org/r/858331 [13:02:14] (03PS1) 10Jbond: idp: Add missing/renamed keys [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) [13:03:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38271/console" [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [13:04:57] 10SRE, 10Pontoon, 10Patch-For-Review, 10User-fgiunchedi: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is now a thing! I've added bootstrap instructions at https://wikitech.wikimedia.org/wiki/Puppet/Pontoon#PKI and optimis... [13:05:32] (03PS8) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [13:05:57] (03CR) 10Muehlenhoff: idp: Add missing/renamed keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [13:07:13] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/858330 (owner: 10Muehlenhoff) [13:09:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [13:11:09] (03PS1) 10Muehlenhoff: Amend docs for rebasing to new upstream release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/858335 [13:12:51] (03PS1) 10Bartosz Dziewoński: Make "Add topic" button sticky [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858308 (https://phabricator.wikimedia.org/T316175) [13:13:29] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Fix condition for lede button to consider new wrappers [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858309 (https://phabricator.wikimedia.org/T323171) [13:14:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [13:15:06] (03PS1) 10Bartosz Dziewoński: Remove override for Minerva hiding .tmbox, no longer needed [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858310 (https://phabricator.wikimedia.org/T257394) [13:15:21] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Fix condition for lede button to consider table of contents [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858311 (https://phabricator.wikimedia.org/T323241) [13:15:35] (03PS1) 10Ssingh: lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) [13:15:42] (03PS1) 10Bartosz Dziewoński: Fix GlobalUsage displaying one more row than requested [extensions/GlobalUsage] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858312 [13:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323214)', diff saved to https://phabricator.wikimedia.org/P40107 and previous config saved to /var/cache/conftool/dbconfig/20221117-131647-ladsgroup.json [13:16:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:16:54] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:17:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T323214)', diff saved to https://phabricator.wikimedia.org/P40108 and previous config saved to /var/cache/conftool/dbconfig/20221117-131709-ladsgroup.json [13:17:31] (03PS1) 10Effie Mouzeli: maps: enable postgres replication slots in codfw [puppet] - 10https://gerrit.wikimedia.org/r/858337 (https://phabricator.wikimedia.org/T290149) [13:22:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [13:25:43] (03PS11) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [13:31:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) @Jclark-ctr - I can take on the initial server imaging, if that helps you out. I know that you've got SLAs in place and whatnot, but from our perspective I don't thin... [13:32:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet [13:33:15] (03CR) 10Muehlenhoff: [C: 03+2] Retire raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/855975 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:34:46] _joe_: claime: thank you again for stepping in on the Gerrit disk space issue this morning. I have been preparing it for a while and really feel dumb to have hit an issue as simple as a full partition :-\ [13:36:10] (03CR) 10Hnowlan: [C: 03+1] maps: enable postgres replication slots in codfw [puppet] - 10https://gerrit.wikimedia.org/r/858337 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [13:37:33] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:35] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38273/console" [puppet] - 10https://gerrit.wikimedia.org/r/857697 (owner: 10Hnowlan) [13:40:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet [13:41:55] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] profile::maps: remove chgrp_log [puppet] - 10https://gerrit.wikimedia.org/r/857697 (owner: 10Hnowlan) [13:42:53] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [13:43:03] (03PS1) 10Muehlenhoff: Retire ganeti.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858339 [13:44:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:45:19] (03PS1) 10Slyngshede: DEB: Add missing requirements, and fix naming in changelog. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 [13:46:26] !log failover ganeti master in codfw to ganeti2021 [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03CR) 10Slyngshede: "Apparently naming in the changelog file matters a great deal." [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [13:47:43] (03CR) 10Herron: dispatch: upgrade to 20221110 and build with local config.js (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [13:47:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323214)', diff saved to https://phabricator.wikimedia.org/P40109 and previous config saved to /var/cache/conftool/dbconfig/20221117-134753-ladsgroup.json [13:47:55] (03PS2) 10Muehlenhoff: Retire ganeti.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858339 (https://phabricator.wikimedia.org/T156955) [13:47:59] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [13:51:13] PROBLEM - ganeti-wconfd running on ganeti2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:52:25] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:52:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet [13:52:37] (03CR) 10Jbond: [V: 03+1] idp: Add missing/renamed keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [13:53:05] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:47] (03PS1) 10Ladsgroup: Move api/index.html to docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858341 (https://phabricator.wikimedia.org/T273179) [13:55:51] (03PS1) 10Jbond: cfssl: make keys optional [puppet] - 10https://gerrit.wikimedia.org/r/858342 [13:56:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 6774 [13:58:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 6774 [13:59:16] (03CR) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400). [14:00:04] MatmaRex and cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] I can deploy today! [14:00:26] MatmaRex: cirno: hi! [14:00:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [14:00:39] hello [14:00:40] o/ [14:01:02] i'd prefer my DiscussionTools backports to go out all at once [14:01:06] MatmaRex: ack [14:01:30] and my GlobalUsage backport can't be tested until wmf.10 is back on Commons, but i verified it on the beta cluster [14:01:35] ack [14:01:46] (03CR) 10Urbanecm: [C: 03+2] Make "Add topic" button sticky [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858308 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [14:01:48] (03CR) 10Urbanecm: [C: 03+2] CommentFormatter: Fix condition for lede button to consider new wrappers [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858309 (https://phabricator.wikimedia.org/T323171) (owner: 10Bartosz Dziewoński) [14:01:50] (03CR) 10Urbanecm: [C: 03+2] Remove override for Minerva hiding .tmbox, no longer needed [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858310 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:01:52] (03CR) 10Urbanecm: [C: 03+2] CommentFormatter: Fix condition for lede button to consider table of contents [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858311 (https://phabricator.wikimedia.org/T323241) (owner: 10Bartosz Dziewoński) [14:01:54] (03CR) 10Urbanecm: [C: 03+2] Fix GlobalUsage displaying one more row than requested [extensions/GlobalUsage] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858312 (owner: 10Bartosz Dziewoński) [14:02:28] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:38] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P40110 and previous config saved to /var/cache/conftool/dbconfig/20221117-140300-ladsgroup.json [14:03:01] (03PS2) 10Urbanecm: fiwiktionary: Add rollbacker group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856705 (https://phabricator.wikimedia.org/T323063) (owner: 10Stang) [14:03:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856705 (https://phabricator.wikimedia.org/T323063) (owner: 10Stang) [14:03:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858331 (owner: 10Jbond) [14:03:13] 10SRE, 10Traffic: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Mohawkdavitty) TLSv1.3 just prevents RSA handshake decryption using the website cert/key, TLSv1.3 uses forward perfect secrecy connections that prevents this, but the... [14:04:17] (03Merged) 10jenkins-bot: fiwiktionary: Add rollbacker group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856705 (https://phabricator.wikimedia.org/T323063) (owner: 10Stang) [14:04:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [14:04:57] ^ expected due to ganeti2020 reboot [14:05:26] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [14:05:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [14:05:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856705|fiwiktionary: Add rollbacker group (T323063)]] [14:06:02] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms [14:06:03] T323063: Enable user group rollbacker on fiwiktionary - https://phabricator.wikimedia.org/T323063 [14:06:25] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:856705|fiwiktionary: Add rollbacker group (T323063)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:06:39] cirno: your patch's at mwdebug1001, can you check? [14:07:16] looking [14:08:07] (03Merged) 10jenkins-bot: Make "Add topic" button sticky [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858308 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [14:08:09] (03Merged) 10jenkins-bot: CommentFormatter: Fix condition for lede button to consider new wrappers [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858309 (https://phabricator.wikimedia.org/T323171) (owner: 10Bartosz Dziewoński) [14:08:10] urbanecm: checked via special:usergrouprights and LGTM [14:08:18] great, syncing! [14:08:38] (03Merged) 10jenkins-bot: Remove override for Minerva hiding .tmbox, no longer needed [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858310 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:08:42] (03Merged) 10jenkins-bot: CommentFormatter: Fix condition for lede button to consider table of contents [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858311 (https://phabricator.wikimedia.org/T323241) (owner: 10Bartosz Dziewoński) [14:09:14] (03Merged) 10jenkins-bot: Fix GlobalUsage displaying one more row than requested [extensions/GlobalUsage] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858312 (owner: 10Bartosz Dziewoński) [14:09:23] just in time :) [14:12:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856705|fiwiktionary: Add rollbacker group (T323063)]] (duration: 06m 35s) [14:12:39] T323063: Enable user group rollbacker on fiwiktionary - https://phabricator.wikimedia.org/T323063 [14:13:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858308 (https://phabricator.wikimedia.org/T316175) (owner: 10Bartosz Dziewoński) [14:13:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858309 (https://phabricator.wikimedia.org/T323171) (owner: 10Bartosz Dziewoński) [14:13:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858310 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:13:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858311 (https://phabricator.wikimedia.org/T323241) (owner: 10Bartosz Dziewoński) [14:13:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GlobalUsage] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858312 (owner: 10Bartosz Dziewoński) [14:13:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:858308|Make "Add topic" button sticky (T316175)]], [[gerrit:858309|CommentFormatter: Fix condition for lede button to consider new wrappers (T323171)]], [[gerrit:858310|Remove override for Minerva hiding .tmbox, no longer needed (T257394)]], [[gerrit:858311|CommentFormatter: Fix condition for lede button to consider table of contents (T323241)]], [[gerrit:858312 [14:13:49] |Fix GlobalUsage displaying one more row than requested]] [14:13:56] (03CR) 10Ayounsi: Add OSPF automation template for EVPN switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:13:58] T316175: Make the mobile Add Topic button easier for people to access - https://phabricator.wikimedia.org/T316175 [14:13:58] T257394: Tmbox template not displayed on WMF wikis on mobile (MinervaNeue skin) - https://phabricator.wikimedia.org/T257394 [14:13:58] T323171: "Learn more about this page" button doesn't appear as expected in mobile DiscussionTools - https://phabricator.wikimedia.org/T323171 [14:13:59] T323241: "Learn more about this page" button always appears when the page has a table of contents in mobile DiscussionTools - https://phabricator.wikimedia.org/T323241 [14:14:13] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:858308|Make "Add topic" button sticky (T316175)]], [[gerrit:858309|CommentFormatter: Fix condition for lede button to consider new wrappers (T323171)]], [[gerrit:858310|Remove override for Minerva hiding .tmbox, no longer needed (T257394)]], [[gerrit:858311|CommentFormatter: Fix condition for lede button to consider table of contents (T323241)]], [[gerr [14:14:14] it:858312|Fix GlobalUsage displaying one more row than requested]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:14:28] MatmaRex: all your backports are at mwdebug1001 now, can you check? [14:14:32] yeah [14:15:52] (03CR) 10JMeybohm: [C: 03+1] "does the right thing IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/858342 (owner: 10Jbond) [14:16:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "Woot woot!" [puppet] - 10https://gerrit.wikimedia.org/r/858339 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:17:08] urbanecm: the changes look good, but i forgot to backport a small dependency in another repo :/ can i add another patch? [14:17:17] ( backport of https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/855724 ) [14:17:36] (needed by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/858310/ … weird that it still merged despite that, i thought Depends-On is aware of branches) [14:17:50] (03CR) 10Jbond: [C: 03+2] apereo_cas: add new fact to detect cas version [puppet] - 10https://gerrit.wikimedia.org/r/858331 (owner: 10Jbond) [14:18:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P40111 and previous config saved to /var/cache/conftool/dbconfig/20221117-141806-ladsgroup.json [14:18:10] MatmaRex: sure thing. [14:18:24] (03PS1) 10Bartosz Dziewoński: hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) [14:18:30] (03PS2) 10Urbanecm: hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:18:32] heh [14:18:36] (03CR) 10Urbanecm: [C: 03+2] hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:18:37] oh oops [14:18:47] !log urbanecm@deploy1002 Sync cancelled. [14:19:21] i wasn't planning to backport the patch that needs this in the first place, but it turned out another patch wouldn't cherry-pick cleanly without it. sorry about that [14:19:27] hashar: No problem that's what we're here for <3 [14:20:08] MatmaRex: no problem. [14:20:34] (03CR) 10Muehlenhoff: [C: 03+2] Retire ganeti.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858339 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:21:29] (03CR) 10Ayounsi: [C: 03+1] "one small comment lgtm otherwise." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:21:36] MatmaRex: there are some reports of `Multiple writes to a write-once: limitreportdata discussiontools-limitreport-timeusage`, is that expected? [14:21:51] urbanecm: known issue, i filed a task the other day [14:21:55] okay [14:22:13] i don't know what causes it and no one explained yet https://phabricator.wikimedia.org/T323065 [14:22:16] but it seems harmless [14:22:46] i see [14:24:20] MatmaRex: looking at that and the deadlock/etc ones, I have a feeling the re-parsing the threads is being called multiple times inside one refreshlinks job [14:24:38] …huh [14:24:43] (duplicate parses) [14:24:57] (03Abandoned) 10Ssingh: [WIP] Arrays for lvs all_class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/855682 (owner: 10BBlack) [14:25:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [14:26:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38279/console" [puppet] - 10https://gerrit.wikimedia.org/r/858342 (owner: 10Jbond) [14:27:13] why would it start on 2022-11-10 though? it's a thursday but there was no train that week IIRC [14:27:48] a massive refreshlinks being queued [14:29:06] browser test failure :/ [14:29:46] (03CR) 10CI reject: [V: 04-1] hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:29:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [14:29:56] let's start again [14:30:12] doesn't look related [14:30:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl: make keys optional [puppet] - 10https://gerrit.wikimedia.org/r/858342 (owner: 10Jbond) [14:30:49] (03CR) 10Urbanecm: [C: 03+2] hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:30:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1019.eqiad.wmnet with reason: Remove from cluster for eventual reimage [14:30:56] yeah [14:31:02] (03CR) 10Jbond: [V: 03+1] idp: Add missing/renamed keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [14:31:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1019.eqiad.wmnet with reason: Remove from cluster for eventual reimage [14:32:28] scap backport seems to not like the MinervaNeue patch, filled it as T323277 [14:32:28] T323277: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 [14:32:39] but that's an error that can be workarounded [14:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323214)', diff saved to https://phabricator.wikimedia.org/P40112 and previous config saved to /var/cache/conftool/dbconfig/20221117-143313-ladsgroup.json [14:33:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [14:33:18] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [14:33:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [14:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T323214)', diff saved to https://phabricator.wikimedia.org/P40113 and previous config saved to /var/cache/conftool/dbconfig/20221117-143334-ladsgroup.json [14:34:00] !log depool cp2042 [14:34:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [14:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [14:37:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [14:39:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [14:39:32] (03PS2) 10Urbanecm: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857785 (https://phabricator.wikimedia.org/T318457) [14:39:38] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857785 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:39:52] shipping this while waiting on CI [14:40:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Move api/index.html to docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858341 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [14:40:24] (03Merged) 10jenkins-bot: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857785 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:40:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5002.eqsin.wmnet [14:41:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [14:41:30] (03CR) 10Giuseppe Lavagetto: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [14:41:41] ...actually, would need a backport too [14:41:54] (03Abandoned) 10Giuseppe Lavagetto: Add rake task to convert deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/856517 (owner: 10Giuseppe Lavagetto) [14:42:11] (03PS1) 10Urbanecm: Revert "GrowthExperiments: Enable unstarred mentorship filters at all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858314 (https://phabricator.wikimedia.org/T318457) [14:42:17] (03CR) 10Urbanecm: [C: 03+2] Revert "GrowthExperiments: Enable unstarred mentorship filters at all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858314 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:42:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "It works without human intervention on most charts" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 (owner: 10Giuseppe Lavagetto) [14:43:04] (03Merged) 10jenkins-bot: Revert "GrowthExperiments: Enable unstarred mentorship filters at all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858314 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:45:08] jouncebot: nowandnext [14:45:09] For the next 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400) [14:45:09] For the next 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400) [14:45:09] In 2 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1700) [14:45:21] urbanecm: ping me once you're done [14:45:28] sure thing [14:45:38] (03Merged) 10jenkins-bot: hacks: Stop hiding .fmbox and .tmbox [skins/MinervaNeue] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858313 (https://phabricator.wikimedia.org/T257394) (owner: 10Bartosz Dziewoński) [14:45:41] 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability (Incident Followup), 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [14:45:46] (03PS2) 10Effie Mouzeli: maps: enable postgres replication slots in codfw [puppet] - 10https://gerrit.wikimedia.org/r/858337 (https://phabricator.wikimedia.org/T290149) [14:46:02] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/857077/38244/" [puppet] - 10https://gerrit.wikimedia.org/r/858337 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [14:46:33] (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable postgres replication slots in codfw [puppet] - 10https://gerrit.wikimedia.org/r/858337 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [14:46:52] scap backport fails, resorting to manual deployment [14:46:58] (03Merged) 10jenkins-bot: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 (owner: 10Giuseppe Lavagetto) [14:47:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [14:47:51] MatmaRex: all six patches are at mwdebug1001 now [14:48:22] 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability (Incident Followup), 10Wikimedia-Incident: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) I have marked with #wikimedia-incident-actionable and #sre-onfire based on the incident re... [14:48:32] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [14:48:35] urbanecm: thanks. looks good now [14:48:39] great, syncing [14:49:59] (03PS1) 10Vgutierrez: Revert "aptrepo: Add thirdparty/terraform" [puppet] - 10https://gerrit.wikimedia.org/r/858315 [14:50:11] (03PS2) 10Vgutierrez: Revert "aptrepo: Add thirdparty/terraform" [puppet] - 10https://gerrit.wikimedia.org/r/858315 [14:50:18] <_joe_> jouncebot: now [14:50:18] For the next 0 hour(s) and 9 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400) [14:50:18] For the next 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1400) [14:50:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5002.eqsin.wmnet [14:50:35] !log urbanecm@deploy1002 Started scap: 4e419212: f659d88b: 65cd6881: 96e86cf: 5b94aca: 7a06c4b98: DiscussionTools, GlobalUsage, MinervaNeue backports (T316175, T323171, T257394, T323241) [14:50:44] T316175: Make the mobile Add Topic button easier for people to access - https://phabricator.wikimedia.org/T316175 [14:50:45] T257394: Tmbox template not displayed on WMF wikis on mobile (MinervaNeue skin) - https://phabricator.wikimedia.org/T257394 [14:50:45] T323171: "Learn more about this page" button doesn't appear as expected in mobile DiscussionTools - https://phabricator.wikimedia.org/T323171 [14:50:45] T323241: "Learn more about this page" button always appears when the page has a table of contents in mobile DiscussionTools - https://phabricator.wikimedia.org/T323241 [14:50:48] (03CR) 10Herron: [C: 03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/857522 (https://phabricator.wikimedia.org/T301944) (owner: 10Filippo Giunchedi) [14:51:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:52:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858315 (owner: 10Vgutierrez) [14:52:28] (03CR) 10Vgutierrez: [C: 03+2] Revert "aptrepo: Add thirdparty/terraform" [puppet] - 10https://gerrit.wikimedia.org/r/858315 (owner: 10Vgutierrez) [14:52:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:04] !log urbanecm@deploy1002 Finished scap: 4e419212: f659d88b: 65cd6881: 96e86cf: 5b94aca: 7a06c4b98: DiscussionTools, GlobalUsage, MinervaNeue backports (T316175, T323171, T257394, T323241) (duration: 04m 29s) [14:55:11] 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10Ottomata) [14:55:12] MatmaRex: all should be live now [14:55:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not great but good enough and will get us going!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [14:55:27] thanks. sorry about making it more complicated [14:55:35] no worries, it happens :) [14:55:38] Amir1: over to you! [14:55:48] !log vgutierrez@apt1001:~$ sudo -i reprepro clearvanished [14:55:50] let's get the party started [14:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] (03CR) 10Ladsgroup: [C: 03+2] Move api/index.html to docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858341 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [14:56:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858341 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [14:56:41] (03CR) 10Muehlenhoff: DEB: Add missing requirements, and fix naming in changelog. (032 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [14:56:53] (03Merged) 10jenkins-bot: Move api/index.html to docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858341 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [14:57:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:858341|Move api/index.html to docroot (T273179)]] [14:57:21] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update bullseye-wikimedia [14:57:23] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [14:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:42] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:858341|Move api/index.html to docroot (T273179)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [15:02:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:02:30] (03PS4) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites [puppet] - 10https://gerrit.wikimedia.org/r/857793 [15:02:32] (03PS2) 10Slyngshede: DEB: Add missing requirements, and fix naming in control. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 [15:02:46] (03PS1) 10Elukey: Add the pause image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) [15:03:15] (03CR) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [15:03:17] 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10BTullis) No objection from me. Do we need any additional approval from elsewhere in #sre or can we just go ahead and make the change? Maybe @odimitrijevic could... [15:03:24] (03CR) 10Slyngshede: DEB: Add missing requirements, and fix naming in control. (032 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [15:03:26] (03Abandoned) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856480 (owner: 10Muehlenhoff) [15:03:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Get rid of extract2.php rewrites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [15:03:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323214)', diff saved to https://phabricator.wikimedia.org/P40114 and previous config saved to /var/cache/conftool/dbconfig/20221117-150335-ladsgroup.json [15:03:41] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:03:46] (03PS2) 10Muehlenhoff: archiva/piwik: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854574 (https://phabricator.wikimedia.org/T308013) [15:04:25] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:858341|Move api/index.html to docroot (T273179)]] (duration: 07m 07s) [15:04:30] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [15:04:31] (03CR) 10Ayounsi: "Looks nice! Some comments but overall lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:05:58] (03CR) 10Herron: dispatch: upgrade to 20221110 and build with local config.js (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [15:06:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:48] (03PS5) 10Ladsgroup: mediawiki: Get rid of extract2.php rewrites [puppet] - 10https://gerrit.wikimedia.org/r/857793 [15:06:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Get rid of extract2.php rewrites [puppet] - 10https://gerrit.wikimedia.org/r/857793 (owner: 10Ladsgroup) [15:07:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1019.eqiad.wmnet with OS bullseye [15:07:45] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye [15:08:48] (03PS3) 10Ladsgroup: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179) [15:10:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38282/console" [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [15:11:14] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [15:11:26] (03CR) 10Elukey: "Built it locally and tested with Docker (created the pause container and then shared some namespaces with a simple echo server)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [15:12:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38284/console" [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [15:13:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: Add missing/renamed keys [puppet] - 10https://gerrit.wikimedia.org/r/858332 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [15:17:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add the pause image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [15:18:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P40115 and previous config saved to /var/cache/conftool/dbconfig/20221117-151842-ladsgroup.json [15:21:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1019.eqiad.wmnet with reason: host reimage [15:22:10] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [15:23:31] !log jnuche@deploy1002 Started scap: testing k8s deploys [15:24:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1019.eqiad.wmnet with reason: host reimage [15:25:28] (03CR) 10Volans: [C: 03+1] "LGTM if the template is changed accordingly :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:26:39] (03PS1) 10Ssingh: aptrepo: add comment about updating external repositories [puppet] - 10https://gerrit.wikimedia.org/r/858349 [15:26:58] (03PS1) 10Jbond: apero_cas: fix key name [puppet] - 10https://gerrit.wikimedia.org/r/858350 (https://phabricator.wikimedia.org/T311235) [15:28:02] (03CR) 10Jbond: [C: 03+2] apero_cas: fix key name [puppet] - 10https://gerrit.wikimedia.org/r/858350 (https://phabricator.wikimedia.org/T311235) (owner: 10Jbond) [15:29:01] (03CR) 10Ayounsi: Change get_underlay_ints() to use Netbox VRF field for filtering (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:33:37] (03PS2) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [15:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P40116 and previous config saved to /var/cache/conftool/dbconfig/20221117-153348-ladsgroup.json [15:34:42] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:34:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:35:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/858335 (owner: 10Muehlenhoff) [15:35:11] 10SRE, 10Traffic: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) this seems to be triggered by HAProxy, I just logged the H1 trace on a cloud test instance using: ` echo "trace h1 event +any; trace h1 level developer; tra... [15:36:32] (03CR) 10CI reject: [V: 04-1] cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:37:00] (03CR) 10Muehlenhoff: aptrepo: add comment about updating external repositories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858349 (owner: 10Ssingh) [15:37:41] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2008.codfw.wmnet [15:37:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2008.codfw.wmnet [15:37:54] !log hnowlan@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host maps2008.codfw.wmnet [15:38:26] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.p to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/857563 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [15:38:41] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:39:08] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [15:39:11] (03PS2) 10Ssingh: aptrepo: add comment about updating external repositories [puppet] - 10https://gerrit.wikimedia.org/r/858349 [15:39:15] (03CR) 10Ssingh: aptrepo: add comment about updating external repositories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858349 (owner: 10Ssingh) [15:40:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858349 (owner: 10Ssingh) [15:40:34] (03CR) 10Ssingh: [C: 03+2] aptrepo: add comment about updating external repositories [puppet] - 10https://gerrit.wikimedia.org/r/858349 (owner: 10Ssingh) [15:40:36] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Amend docs for rebasing to new upstream release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/858335 (owner: 10Muehlenhoff) [15:41:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1019.eqiad.wmnet with OS bullseye [15:41:17] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye completed: - ganeti1019 (**PASS**) - Downtimed on... [15:41:19] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:41:40] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:41:57] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:41:57] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:41:57] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:41:57] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:41:57] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:41:57] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:41:58] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [15:41:58] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [15:42:18] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [15:42:18] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:42:19] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:43:32] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:44:14] (03CR) 10Slyngshede: [V: 03+1] DEB: Add missing requirements, and fix naming in control. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [15:44:20] (03CR) 10Slyngshede: [V: 03+2] DEB: Add missing requirements, and fix naming in control. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [15:44:22] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] DEB: Add missing requirements, and fix naming in control. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/858340 (owner: 10Slyngshede) [15:45:11] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [15:45:36] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:45:37] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:45:37] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:45:40] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:45:47] (03PS1) 10Effie Mouzeli: maps: hieradata tidy up [puppet] - 10https://gerrit.wikimedia.org/r/858352 [15:46:26] (03PS2) 10Elukey: Add the pause image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) [15:47:51] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [15:48:50] (03CR) 10Effie Mouzeli: "NOOP https://puppet-compiler.wmflabs.org/output/858352/38287/" [puppet] - 10https://gerrit.wikimedia.org/r/858352 (owner: 10Effie Mouzeli) [15:48:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323214)', diff saved to https://phabricator.wikimedia.org/P40117 and previous config saved to /var/cache/conftool/dbconfig/20221117-154855-ladsgroup.json [15:48:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:49:01] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [15:49:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:49:53] (03CR) 10Effie Mouzeli: [C: 03+2] maps: hieradata tidy up [puppet] - 10https://gerrit.wikimedia.org/r/858352 (owner: 10Effie Mouzeli) [15:50:31] sukhe: I beat you to it, merge yours as well? [15:50:43] oh please do [15:50:46] thanks! [15:50:49] cheers [15:52:32] !log mforns@deploy1002 Started deploy [analytics/refinery@d7388a6]: Regular analytics weekly train [analytics/refinery@d7388a6] [15:55:15] jouncebot: nowandnext [15:55:15] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [15:55:15] In 1 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1700) [15:55:42] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:55:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:55:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:55:43] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:55:43] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [15:55:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [15:55:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:55:44] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:56:10] (03CR) 10Ladsgroup: [C: 03+2] Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [15:56:51] (03Merged) 10jenkins-bot: Get rid of extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [15:57:12] (03CR) 10Elukey: Add the pause image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) (owner: 10Elukey) [15:57:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857794 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [15:57:48] !log mforns@deploy1002 Finished deploy [analytics/refinery@d7388a6]: Regular analytics weekly train [analytics/refinery@d7388a6] (duration: 05m 15s) [15:59:12] !log mforns@deploy1002 Started deploy [analytics/refinery@d7388a6] (thin): Regular analytics weekly train THIN [analytics/refinery@d7388a6] [15:59:14] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [15:59:16] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:59:20] !log mforns@deploy1002 Finished deploy [analytics/refinery@d7388a6] (thin): Regular analytics weekly train THIN [analytics/refinery@d7388a6] (duration: 00m 08s) [15:59:22] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:59:38] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:59:46] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [15:59:48] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:00:00] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:00:02] !log mforns@deploy1002 Started deploy [analytics/refinery@d7388a6] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d7388a6] [16:00:36] (03PS12) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [16:00:38] (03PS8) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [16:00:40] (03PS5) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [16:00:56] (03PS1) 10Volans: setup.py: remove support from Python 3.7 and 3.8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858355 [16:01:15] !log mforns@deploy1002 Finished deploy [analytics/refinery@d7388a6] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d7388a6] (duration: 01m 13s) [16:02:58] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:857794|Get rid of extract2.php (T273179)]] [16:03:03] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [16:03:23] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:857794|Get rid of extract2.php (T273179)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [16:03:37] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:04:33] (03PS1) 10Ottomata: WIP flink image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 [16:04:57] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, and 2 others: 502 Server Hangup Error on esams for "Upload a new version of this file" on Special:Upload on Commons - https://phabricator.wikimedia.org/T247454 (10jijiki) [16:05:43] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:07:37] (03PS3) 10Elukey: Add the pause image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858345 (https://phabricator.wikimedia.org/T322920) [16:08:22] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d33ab6c]: implement incoming_links update as a batch job [16:08:50] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:857794|Get rid of extract2.php (T273179)]] (duration: 05m 51s) [16:08:55] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [16:10:48] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d33ab6c]: implement incoming_links update as a batch job (duration: 02m 26s) [16:11:50] 10SRE, 10Traffic, 10Upstream: Wikipedia on flow with no http request, still responds with a Bad Request 400 - https://phabricator.wikimedia.org/T323263 (10Vgutierrez) 05Open→03Stalled reported to upstream in https://github.com/haproxy/haproxy/issues/1934 [16:12:51] !log active CAS instance has been switched to CAS 6.6.2 (from 6.4.6.3) T311235 [16:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:56] T311235: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235 [16:13:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ayounsi) I was wondering if there was any timeline for this, to unblock {T308339} Thanks! [16:13:16] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [16:14:48] (03PS6) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [16:20:47] (03PS2) 10Cathal Mooney: Change get_underlay_ints() to use Netbox VRF field for filtering [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) [16:21:14] (03CR) 10Cathal Mooney: Change get_underlay_ints() to use Netbox VRF field for filtering (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:21:34] (03Abandoned) 10Filippo Giunchedi: Add 'pybal_server_pooled' metric [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [16:23:31] jouncebot: nowandnext [16:23:31] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [16:23:31] In 0 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1700) [16:23:57] !log jnuche@deploy1002 Installing scap version "4.28.2" for 559 hosts [16:25:00] 10SRE, 10DynamicPageList (Wikimedia), 10serviceops-radar, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jijiki) [16:27:00] (03CR) 10Volans: [C: 03+2] setup.py: remove support from Python 3.7 and 3.8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858355 (owner: 10Volans) [16:27:34] (03CR) 10Ayounsi: [C: 03+1] "+1 assuming the underlying template is being updated to account for those changes." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:28:47] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [16:30:07] (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Run updateIsActiveFlagForMentees weekly [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [16:30:56] (03Merged) 10jenkins-bot: setup.py: remove support from Python 3.7 and 3.8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858355 (owner: 10Volans) [16:30:58] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38288/console" [puppet] - 10https://gerrit.wikimedia.org/r/858346 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [16:31:02] !log jnuche@deploy1002 Started scap: testing k8s deploys [16:31:56] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:31:57] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:32:37] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:32:37] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:32:38] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:32:39] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [16:32:39] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [16:32:39] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:32:39] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [16:32:39] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [16:32:39] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [16:32:40] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [16:32:42] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [16:32:53] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) [16:33:00] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [16:33:02] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [16:33:03] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [16:33:04] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:33:11] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:33:17] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:35:23] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:36:29] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [16:36:48] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:36:50] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:36:50] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [16:36:50] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [16:36:50] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [16:36:50] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:36:50] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [16:36:50] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [16:36:51] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [16:36:59] PROBLEM - SSH on mw1331.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:04] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [16:37:11] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [16:37:24] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:38:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858359 [16:38:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:39:01] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v5.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858359 (owner: 10Volans) [16:39:29] (03CR) 10David Caro: [C: 03+2] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [16:39:34] (03CR) 10David Caro: [C: 03+2] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [16:40:22] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [16:40:22] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [16:40:42] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) The analysis of the core dump by JTAC showed that we were victim of this bug https://prsearch.juniper.net/problemreport/PR1080132 Even... [16:40:52] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:40:52] (03PS6) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [16:40:55] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:40:58] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [16:41:13] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:42:55] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:03] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v5.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858359 (owner: 10Volans) [16:43:05] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [16:43:21] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [16:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:44:04] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) [16:44:47] (03PS4) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [16:45:45] (03PS1) 10Volans: Upstream release v5.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/858361 [16:46:21] !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 15m 19s) [16:46:51] (03CR) 10CI reject: [V: 04-1] Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:47:02] (03CR) 10Volans: [C: 03+2] Upstream release v5.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/858361 (owner: 10Volans) [16:47:46] (03PS1) 10Jbond: apero_cas: (WIP) add addtional paramas for OIDC [puppet] - 10https://gerrit.wikimedia.org/r/858362 (https://phabricator.wikimedia.org/T311999) [16:48:13] !log jnuche@deploy1002 Installing scap version "4.28.2" for 1 hosts [16:48:44] (03PS1) 10Filippo Giunchedi: team-dcops: add alerts for mgmt down [alerts] - 10https://gerrit.wikimedia.org/r/858363 (https://phabricator.wikimedia.org/T310266) [16:49:09] (03CR) 10Cathal Mooney: [C: 03+2] Change get_underlay_ints() to use Netbox VRF field for filtering [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:51:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 (owner: 10Volans) [16:51:35] (03Merged) 10jenkins-bot: Upstream release v5.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/858361 (owner: 10Volans) [16:51:44] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 (owner: 10Volans) [16:52:17] (03CR) 10Filippo Giunchedi: [C: 03+2] team-dcops: add alerts for mgmt down [alerts] - 10https://gerrit.wikimedia.org/r/858363 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [16:53:25] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Change get_underlay_ints() to use Netbox VRF field for filtering [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857477 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:54:18] 10SRE-OnFire, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Jelto) > SRE: > > [ ] Bring primary and replica in sync configuration-wise (SRE) > [ ] summarize disk stuff (Partman recipe etc.... [16:54:53] (03PS5) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [16:55:12] !log uploaded spicerack_5.0.1 to apt.wikimedia.org bullseye-wikimedia [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:37] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/858360 (owner: 10Clément Goubert) [16:56:39] (03Merged) 10jenkins-bot: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/857725 (owner: 10Volans) [16:57:30] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10thcipriani) (meta note: tagging with the weird-for-this-task tag: #gitlab-boomerang because that's our curr... [16:58:05] (03CR) 10Filippo Giunchedi: "Forgot to add: you need to bump the changelog too" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [16:59:11] (03PS3) 10Cathal Mooney: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) [16:59:12] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-web helmfile deployment - https://phabricator.wikimedia.org/T321900 (10Clement_Goubert) 05In progress→03Resolved [16:59:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [16:59:33] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-jobrunner helmfile deployment - https://phabricator.wikimedia.org/T321897 (10Clement_Goubert) 05In progress→03Resolved [16:59:45] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [16:59:53] (03CR) 10Cathal Mooney: Unify routing-intstance config across JunOS devices (037 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [17:00:05] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1700) [17:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:05] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:00:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-ext helmfile deployment - https://phabricator.wikimedia.org/T321896 (10Clement_Goubert) 05In progress→03Resolved [17:00:21] o/ [17:00:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Create mw-api-int helmfile deployment - https://phabricator.wikimedia.org/T321895 (10Clement_Goubert) 05In progress→03Resolved [17:00:32] catching up from overnight scrollback: thank you claime + _joe_ for quick lvm work to restore gerrit <3 [17:00:33] urbanecm: hey! with you in a few minutes, sorry [17:00:38] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:00:49] rzl: no worries, I'll wait. [17:00:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) [17:01:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [17:01:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [17:02:03] (03PS2) 10Clément Goubert: mw-*: Remove sal logging hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/858360 (https://phabricator.wikimedia.org/T323296) [17:02:23] thcipriani: yw :) [17:03:28] (03PS3) 10Clément Goubert: mw-*: Remove sal logging hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/858360 (https://phabricator.wikimedia.org/T323296) [17:06:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:42] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:07:05] (03PS3) 10Arturo Borrero Gonzalez: cloudvirts: introduce modern NIC setup and use it by default [puppet] - 10https://gerrit.wikimedia.org/r/858328 (https://phabricator.wikimedia.org/T319184) [17:10:01] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [17:10:02] (03CR) 10JMeybohm: "This breaks puppet in my pontoon stack:" [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [17:10:56] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1001.eqiad.wmnet [17:11:07] (03CR) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [17:11:11] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [17:12:08] (03CR) 10Cathal Mooney: Add OSPF automation template for EVPN switches (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [17:13:22] (03CR) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [17:16:16] urbanecm: sorry about that! looking now -- PCC seems to have an error for that change but I'm not sure if it's due to your code, which seems fine, going to rerun out of an abundance of caution [17:16:34] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10jijiki) 05Open→03Resolved a:03jijiki Bluntly closing [17:16:35] Okay [17:17:08] (03PS2) 10Herron: dispatch: upgrade to 20221110 and build with local config.js [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) [17:17:21] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38289/console" [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [17:17:40] <_joe_> rzl: rebase the change [17:17:48] urbanecm: my fault for overetaining in a meeting rzl for longer than expected! [17:17:59] _joe_: oh duh, thanks [17:18:06] No worries jynus :) [17:18:23] (03PS2) 10RLazarus: GrowthExperiments: Run updateIsActiveFlagForMentees weekly [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [17:18:57] <_joe_> rzl: I mean if it was failing, there was an interval of 3 days where role::mediawiki::maintenance failed to compile [17:19:11] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38290/console" [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [17:19:34] okay, *that* failure is just because I can't spell [17:19:44] <_joe_> ahah ok [17:19:49] but, "CRITICAL: Unexpected error running run_host: Unable to find fact file for: mwmain1002.eqiad.wmnet under directory /var/lib/catalog-differ/puppet" probably shouldn't get posted to the change as PCC SUCCESS [17:19:54] (n.b. mwmain) [17:20:04] <_joe_> I was definitely overthinking it then :P [17:20:05] once more for real [17:20:17] no, the original failure you were totally right I think [17:20:29] and rebasing was correct, I just screwed up rerunning it after I rebased [17:20:37] <_joe_> ack, lol [17:21:48] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38291/console" [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [17:21:54] "wow" I thought to myself "it runs a whole lot faster after all the changes j.bond made, he didn't even mention that" [17:22:14] but no it just goes quicker if you type in a bogus hostname and there's nothing to compile [17:22:37] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [17:23:09] okay that's better! going ahead, urbanecm will you want me to test-start the job once it's at mwmaint1002? [17:23:12] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [17:23:36] rzl: i ran the script manually earlier today, so it's fine to wait for the next run :) [17:23:42] 👍 [17:23:47] (03CR) 10RLazarus: [V: 03+1 C: 03+2] GrowthExperiments: Run updateIsActiveFlagForMentees weekly [puppet] - 10https://gerrit.wikimedia.org/r/857776 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [17:26:52] rzl: just curious, is pcc supposed to provide different/better results the way you ran it? I'm asking as i used `check experimental` on the change (which returned https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/1480/console). [17:27:52] urbanecm: yeah, if you click through from there to "console output" there's a line that says "Run finished; see your results at https://puppet-compiler.wmflabs.org/output/857776/1480/" [17:28:33] (which provides more detailed output about what changed) [17:28:53] but in this case, clicking on the hostname there gets you a header that says "compiler failure" which is what I was looking at [17:29:20] yeah, and it looks very similar to the PCC you've posted (the last working one i mean). I'm asking if there's any difference between those two [17:29:29] I actually got the same error on the rerun after all, but I think it's bogus; a bunch of stuff about this recently changed so I'll follow up about it to see what's going on there [17:29:43] I don't think there is a difference after all, no, sorry for not being clearer [17:30:10] okay, thanks [17:34:58] (03PS13) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [17:35:00] (03PS9) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [17:35:02] urbanecm: iirc, both just launch a jenkins job [17:35:02] (03PS7) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [17:35:58] i see [17:36:24] jouncebot: nowandnext [17:36:24] For the next 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1700) [17:36:25] In 0 hour(s) and 23 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1800) [17:37:48] RECOVERY - SSH on mw1331.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:53] rzl, urbanecm: thinking of deploying a backport for a train blocker, let me know when i won't be stepping on your toes? [17:38:05] not on mine :) [17:38:09] brennen: go for it [17:38:13] thx! [17:39:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858226 (https://phabricator.wikimedia.org/T323254) (owner: 10Krinkle) [17:42:52] PROBLEM - puppet last run on sretest1001 is CRITICAL: CRITICAL: Puppet has been disabled for 789350 seconds, message: alex testing - akosiaris, last run 9 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:45:50] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host sretest1001.eqiad.wmnet [17:45:51] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [17:46:01] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [17:52:10] 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10RobH) @btullis: I've gone ahead and requested quotation to get replacement batteries. In the future, be aware we have a [[ ht... [17:53:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:53:25] (03Merged) 10jenkins-bot: InitializeArticleMaybeRedirect hook: Improve docs & restrict [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858226 (https://phabricator.wikimedia.org/T323254) (owner: 10Krinkle) [17:53:53] !log brennen@deploy1002 Started scap: Backport for [[gerrit:858226|InitializeArticleMaybeRedirect hook: Improve docs & restrict (T323254)]] [17:53:58] T323254: MediaWiki->initializeArticle on FlaggedRevs wikis triggers deprecated "Unexpected clearActionName after getActionName" - https://phabricator.wikimedia.org/T323254 [17:54:17] !log brennen@deploy1002 brennen and krinkle: Backport for [[gerrit:858226|InitializeArticleMaybeRedirect hook: Improve docs & restrict (T323254)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [17:58:20] (03CR) 10FNegri: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [17:58:33] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1001.eqiad.wmnet [17:59:48] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:858226|InitializeArticleMaybeRedirect hook: Improve docs & restrict (T323254)]] (duration: 05m 55s) [17:59:53] T323254: MediaWiki->initializeArticle on FlaggedRevs wikis triggers deprecated "Unexpected clearActionName after getActionName" - https://phabricator.wikimedia.org/T323254 [18:00:05] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1800). [18:02:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:40] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 464 and 3042 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:04:43] (03CR) 10FNegri: [C: 04-1] "I think there is one naming inconsistency left" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:05:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [18:05:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2008.codfw.wmnet [18:07:03] (03CR) 10FNegri: [C: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:07:45] (03PS1) 10JHathaway: aux-k8s: monitor eqiad BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) [18:08:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:47] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38292/console" [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [18:14:07] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetdb2003.codfw.wmnet [18:16:34] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config [18:17:33] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy mysql.port value to local config (duration: 00m 58s) [18:22:40] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:26:01] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:27:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Stop spamming SAL with helmfile on scap deployments - https://phabricator.wikimedia.org/T323296 (10JMeybohm) helmfile_log_sal has support for that already: ` # Allow to explicitely suppress logging to SAL SUPPRESS_SAL=${SUPPRESS_SAL:-false} ` [18:27:38] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:28:41] (03CR) 10David Caro: [C: 03+2] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:31:49] (03Merged) 10jenkins-bot: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:32:01] (03Merged) 10jenkins-bot: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [18:33:12] PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:36:47] !log ebernhardson@deploy1002 Started deploy [wdqs/wdqs@fb7d161]: 0.3.118 [18:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1003:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:44:24] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:44:51] !log upgraded spicerack to v5.0.1 on the cumin hosts [18:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:02] (03PS8) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [18:45:04] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (034 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:45:17] (03PS1) 10Ebernhardson: Increase CirrusSearch-Search pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 [18:46:11] (03PS2) 10Ebernhardson: Increase CirrusSearch-Search pool counter by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 [18:46:22] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:18] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Dzahn) >>! In T323262#8403129, @Jelto wrote: > @Dzahn what are you thoughts on reimaging `gerrit2002` with... [18:48:00] !log ebernhardson@deploy1002 Finished deploy [wdqs/wdqs@fb7d161]: 0.3.118 (duration: 11m 12s) [18:51:36] (03PS1) 10Dzahn: hieradata: switch active Phabricator server to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) [18:52:46] had a bit of weirdness with last backport sync, running a sync world to see if that's reproducible before the train window. [18:52:51] !log brennen@deploy1002 Started scap: no-op deploy to attempt re-pull on parse1015.eqiad.wmnet [18:54:14] (03PS1) 10Jbond: redfish: update reboot detection loging [software/spicerack] - 10https://gerrit.wikimedia.org/r/858398 [18:57:13] !log brennen@deploy1002 Finished scap: no-op deploy to attempt re-pull on parse1015.eqiad.wmnet (duration: 04m 21s) [18:57:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/858398 (owner: 10Jbond) [18:59:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2042.codfw.wmnet [19:00:04] brennen and jeena: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T1900). [19:00:21] (03CR) 10Jbond: [C: 03+2] redfish: update reboot detection loging [software/spicerack] - 10https://gerrit.wikimedia.org/r/858398 (owner: 10Jbond) [19:00:36] o/ [19:01:14] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts puppetdb2003.codfw.wmnet [19:02:51] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [19:05:01] (03Merged) 10jenkins-bot: redfish: update reboot detection loging [software/spicerack] - 10https://gerrit.wikimedia.org/r/858398 (owner: 10Jbond) [19:05:40] (03CR) 10Volans: [C: 04-1] "Need to fix the virtualization bit" [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [19:06:24] !log train 1.40.0-wmf.10 (T320515) - no current blockers; rolling first to group1, 10 minutes or so to bake in, then will attempt all wikis. [19:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:30] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:06:52] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858402 (https://phabricator.wikimedia.org/T320515) [19:06:54] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [19:06:56] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858402 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:07:09] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858403 [19:07:18] (03CR) 10Volans: [V: 03+2 C: 03+2] CHANGELOG: add changelogs for release v5.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/858403 (owner: 10Volans) [19:08:55] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858402 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:09:37] (03PS1) 10Volans: Upstream release v5.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/858404 [19:09:45] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v5.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/858404 (owner: 10Volans) [19:13:02] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.10 refs T320515 [19:13:08] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:13:43] !log uploaded spicerack_5.0.2 to apt.wikimedia.org bullseye-wikimedia [19:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:35] !log installed spicerack v5.0.2 on the cumin hosts [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:43] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.10 refs T320515 (duration: 03m 40s) [19:25:23] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [19:28:22] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:30:16] (03PS1) 10Herron: prometheus: disable caching of prometheus-site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) [19:34:04] RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:34:09] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) > Do we need any additional approval from elsewhere in SRE or can we just go ahead and make the change Regarding approvals, if the change is j... [19:35:17] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [19:36:18] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10Dzahn) not fixed or now fixed:) either way, thank you for the new compiler! will do if I ever notice it again. [19:38:44] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: jenkins-bot puppet-compiler-test may report SUCCESS though compiling failed - https://phabricator.wikimedia.org/T214629 (10jbond) :) now fixed, updated [19:38:47] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Grant ssh access to analytics-admins to dcausse and gmodena - https://phabricator.wikimedia.org/T323280 (10jcrespo) According to Namely, Will and Guillome should approve for each + either Otto or Olja from your side (let me know if that is up to date). [19:39:20] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:40:56] (03PS2) 10JHathaway: aux-k8s: monitor eqiad BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) [19:41:06] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:30] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [19:44:48] (03CR) 10Herron: [V: 03+2 C: 03+2] dispatch: upgrade to 20221110 and build with local config.js (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857781 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [19:44:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38294/console" [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:48:51] (03CR) 10Vgutierrez: [V: 03+1] prometheus: disable caching of prometheus-site.wm.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:50:08] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:50:43] (03PS2) 10Herron: prometheus: disable caching of prometheus-site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) [19:52:16] (03CR) 10Herron: prometheus: disable caching of prometheus-site.wm.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:53:52] (03PS1) 10Vgutierrez: cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 [19:54:32] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2052.codfw.wmnet with OS bullseye [19:54:54] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10RKemper) Reimaging now [19:54:59] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2052.codfw.wmnet with OS bullseye [19:56:04] (03PS2) 10Vgutierrez: cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 [19:56:28] (03PS3) 10JHathaway: aux-k8s: monitor eqiad BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) [19:57:26] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [19:58:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10RobH) T323301 tracks the ordering of 23 new raid controller batteries. While only 8 have failed so far, we have a to... [19:58:02] (03PS1) 10Dzahn: phabricator: switch from phab1001 to phab1004 [dns] - 10https://gerrit.wikimedia.org/r/858409 [19:58:28] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38295/console" [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [19:58:54] (03CR) 10JHathaway: [V: 03+1 C: 03+2] aux-k8s: monitor eqiad BGP sessions [puppet] - 10https://gerrit.wikimedia.org/r/858395 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:00:07] (03PS3) 10Ebernhardson: Increase CirrusSearch-Search pool counter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 [20:02:10] (03CR) 10Ssingh: [C: 03+1] cache: Remove wikiba.se caching rules [puppet] - 10https://gerrit.wikimedia.org/r/858408 (owner: 10Vgutierrez) [20:02:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:38] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-36), 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MusikAnimal) 05Stalled→03Resolved There's nothing to... [20:09:29] (03PS2) 10Dzahn: phabricator: switch from phab1001 to phab1004, discovery and SPF [dns] - 10https://gerrit.wikimedia.org/r/858409 (https://phabricator.wikimedia.org/T280597) [20:10:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:40] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage [20:11:43] (03CR) 10Dzahn: [C: 03+1] "yea, this has been removed. it's controlled by WMDE and points to lambda.wikimedia.de" [puppet] - 10https://gerrit.wikimedia.org/r/858408 (owner: 10Vgutierrez) [20:12:09] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 [20:13:14] (03CR) 10Dzahn: [C: 03+1] "stuff like https://phabricator.wikimedia.org/T232246 can also be untagged from SRE" [puppet] - 10https://gerrit.wikimedia.org/r/858408 (owner: 10Vgutierrez) [20:13:29] 10SRE, 10Wikidata, 10wdwb-tech, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Dzahn) This is a task for WMDE but for WMF SRE anymore. wikiba.se is controlled by WMDE, not WMF now. [20:13:39] 10SRE, 10Wikidata, 10wdwb-tech, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Dzahn) also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/858408/ [20:13:59] (03PS1) 10Urbanecm: dumps: Keep only 13 latest growthmentorship dumps [puppet] - 10https://gerrit.wikimedia.org/r/858410 [20:14:28] 10SRE, 10Wikidata, 10wdwb-tech, 10wikiba.se website, 10HTTPS: Set HSTS on wikiba.se (force HTTPS) - https://phabricator.wikimedia.org/T232246 (10Dzahn) @Addshore Is it really "External Realm" anymore? [20:15:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage [20:15:06] ^ removing SRE tag from that since .. it makes no sense we have that ticket if the domain is not under our control since years now [20:17:06] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Call provision cookbook after upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/858401 (owner: 10Jbond) [20:19:51] (03PS1) 10Dzahn: update SPF record for phabricator.wikimedia.org, phab2001->phab2002 [dns] - 10https://gerrit.wikimedia.org/r/858412 (https://phabricator.wikimedia.org/T280597) [20:20:32] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:24:20] (03PS1) 10Jbond: sre/hardware/upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 [20:25:54] (03CR) 10Volans: [C: 03+1] "LGTM, much simpler than the alternatives!!" [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [20:26:10] (03PS3) 10Krinkle: Enable logging for 'rdbms' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842933 (https://phabricator.wikimedia.org/T320873) [20:27:10] (03CR) 10Vgutierrez: "my suggestion actually makes PCC unhappy:" [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:27:43] (03PS2) 10Urbanecm: dumps: Keep only 13 latest growthmentorship dumps [puppet] - 10https://gerrit.wikimedia.org/r/858410 [20:28:35] (03PS3) 10Urbanecm: dumps: Keep only 13 latest growthmentorship dumps [puppet] - 10https://gerrit.wikimedia.org/r/858410 [20:29:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10RobH) a:05BTullis→03Jclark-ctr John, When the shipment of replacement batteries arrive please coordinate with @b... [20:29:35] (03CR) 10Herron: prometheus: disable caching of prometheus-site.wm.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:31:08] herron: sorry about that :_) [20:31:17] vgutierrez: haha no worries at all [20:31:41] (03PS3) 10Herron: prometheus: disable caching of prometheus-site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) [20:33:49] herron: actually.. if we wanna keep the syntax as expected... we should have the same kind of definition in both structures [20:34:21] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 [20:35:14] herron: so profile::trafficserver::backend::mapping_rules has one rule per site, hence cache::req_handling should have one per site as well [20:36:05] herron: does it make sense? [20:36:38] technically PS3 should work.. but it's a hack around our varnish<->ats mapping logic [20:36:51] (03CR) 10Volans: [C: 03+1] "LGTM, modulo nits in the wording" [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [20:37:14] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 [20:37:59] (03PS1) 10Urbanecm: GrowthExperiments: Enable unstarred mentorship filters at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858414 (https://phabricator.wikimedia.org/T318457) [20:38:04] one rule per site, as in create a rule for each fqdn instead of using the regex? [20:38:52] Yes [20:39:10] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 [20:39:45] kk I'll update [20:39:56] I'll add one beer per site for you [20:40:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2052.codfw.wmnet with OS bullseye [20:40:37] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2052.codfw.wmnet with OS bullseye completed: - elastic2052 (**WARN**) -... [20:42:35] (03PS4) 10Herron: prometheus: disable caching of prometheus-site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) [20:43:10] vgutierrez: ha what a deal! [20:44:07] one beer per site = a full six pack [20:46:11] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetdb2003.codfw.wmnet [20:46:23] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetdb2003.codfw.wmnet [20:46:29] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts puppetdb2003.codfw.wmnet [20:48:22] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) [20:49:11] (03PS4) 10Samtar: Increase CirrusSearch-Search pool counter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 (owner: 10Ebernhardson) [20:49:18] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) 05Open→03Resolved Controllers are up and operational [20:49:24] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-worker - https://phabricator.wikimedia.org/T321138 (10jhathaway) 05Open→03Resolved Workers are up and operational [20:52:44] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts puppetdb2003.codfw.wmnet [20:58:07] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38300/console" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:58:50] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [21:00:05] brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221117T2100). Please do the needful. [21:00:05] ebernhardson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] I can deploy [21:01:32] \o [21:01:38] ebernhardson: o/ [21:01:40] mine is super easy, it's just changing a number in pool counter settings [21:02:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 (owner: 10Ebernhardson) [21:02:11] (03CR) 10Dzahn: [C: 03+2] "host 2620:0:860:103:10:192:32:54" [dns] - 10https://gerrit.wikimedia.org/r/858412 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:02:14] super easy is good :D [21:02:47] (03Merged) 10jenkins-bot: Increase CirrusSearch-Search pool counter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858396 (owner: 10Ebernhardson) [21:02:53] !log replacing phab2001 (decom'ed) with phab2002 in Phabricator SPF TXT record in DNS [21:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:16] !log samtar@deploy1002 Started scap: Backport for [[gerrit:858396|Increase CirrusSearch-Search pool counter by 10%]] [21:03:17] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [21:03:37] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: ensure we disable hostcheck (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [21:03:41] !log samtar@deploy1002 samtar and ebernhardson: Backport for [[gerrit:858396|Increase CirrusSearch-Search pool counter by 10%]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:03:45] ebernhardson: can you test this on mwdebug? [21:04:07] TheresNoTime: not really, it sets a limit for the number of queries that can be issued in parallel cluster-wide [21:04:19] thought as much :) will sync [21:04:46] (03CR) 10BCornwall: [V: 03+1 C: 03+2] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:07:27] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: ensure we disable hostcheck [cookbooks] - 10https://gerrit.wikimedia.org/r/858413 (owner: 10Jbond) [21:07:52] 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05Open→03Resolved The new patch which was just deployed addresses all these concerns. I'll close the ticket but please feel free to... [21:08:35] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:858396|Increase CirrusSearch-Search pool counter by 10%]] (duration: 05m 19s) [21:08:46] ebernhardson: all live :) [21:09:14] TheresNoTime: awesome, thanks! [21:09:35] no worries [21:09:54] * TheresNoTime will be around for a bit if there are any other patches [21:10:43] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38302/console" [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:11:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:11:46] (03CR) 10Herron: "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:11:57] (03CR) 10Herron: [C: 03+2] prometheus: disable caching of prometheus-site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/858406 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:12:06] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/858397/38301/" [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:12:52] (03CR) 10Dzahn: [V: 03+1] "see https://puppet-compiler.wmflabs.org/output/858397/38301/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/858397 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:13:21] (03PS1) 10BCornwall: prometheus: Remove old ats config export job [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) [21:17:31] PROBLEM - Check systemd state on cp5010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-trafficserver-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:38] (03PS1) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 [21:18:50] (03PS1) 10Dzahn: phabricator: remove phab1001 as src_host from migration class [puppet] - 10https://gerrit.wikimedia.org/r/858420 [21:19:02] !log closing UTC late backport window [21:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:45] (03PS1) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 [21:22:11] PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-trafficserver-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:49] ^on it [21:22:53] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-trafficserver-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:41] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:49] (03PS1) 10Herron: Revert "dispatch: upgrade to 20221110 and build with local config.js" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858316 [21:25:14] (03CR) 10Jbond: redfish: add update commands using the patch method (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:25:39] RECOVERY - Check systemd state on cp5010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:39] RECOVERY - Check systemd state on cp1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:52] (03CR) 10Herron: [V: 03+2 C: 03+2] Revert "dispatch: upgrade to 20221110 and build with local config.js" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858316 (owner: 10Herron) [21:28:00] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38303/console" [puppet] - 10https://gerrit.wikimedia.org/r/858418 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:31:07] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:33:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:37:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2173'] [21:38:12] (03PS7) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [21:38:23] (03CR) 10Dzahn: "thank you. did not expect this would influence other components" [puppet] - 10https://gerrit.wikimedia.org/r/858315 (owner: 10Vgutierrez) [21:39:25] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T321339 (10Papaul) 05Open→03Resolved a:03Papaul This was already fixed. [21:40:39] (03PS8) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [21:41:09] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:41:30] (03CR) 10Jbond: redfish: add update commands using the patch method (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:42:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:42:43] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [21:43:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2173'] [21:44:13] TheresNoTime: done with deploy? [21:44:17] Would like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842933 as late arrival :) [21:44:36] also happy to roll out myself, haven't used the new scap command yet so that could be a first. [21:44:45] Krinkle: feel free to :) [21:44:49] (03CR) 10Krinkle: [C: 03+2] Enable logging for 'rdbms' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842933 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [21:44:57] (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [21:44:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:45:35] (03Merged) 10jenkins-bot: Enable logging for 'rdbms' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842933 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [21:48:23] 21:48:11 'https://gerrit.wikimedia.org/r/842933' is not a valid change number or URL [21:48:37] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jhathaway) > Oh wow! I remember us talking bolt a few months ago. How possible is it > to feed it arbitrary hieradata + Puppet plan + template and have it do > its thing?... [21:48:41] I guess it takes the long form r/c/operations/mediawiki-config/+/842933/ only [21:48:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842933 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [21:49:02] Krinkle: Feel free to file a ticket to improve. [21:49:07] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:842933|Enable logging for 'rdbms' channel (T320873)]] [21:49:14] T320873: Consolidate rdbms logging channels into one - https://phabricator.wikimedia.org/T320873 [21:49:31] !log krinkle@deploy1002 krinkle and krinkle: Backport for [[gerrit:842933|Enable logging for 'rdbms' channel (T320873)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:49:46] s/Feel free/please do/ [21:50:01] 10SRE, 10Traffic-Icebox: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) 05In progress→03Resolved Done! Thanks for the eagle eyes :) I'll go ahead and close this. If there's anything more I'm missing please do re-open! [21:52:48] dancy: ack, done :) [21:54:44] (03PS3) 10Krinkle: es_exporter: Include channel=rdbms in query_log_mediawiki_mysql [puppet] - 10https://gerrit.wikimedia.org/r/842935 (https://phabricator.wikimedia.org/T320873) [21:58:01] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:842933|Enable logging for 'rdbms' channel (T320873)]] (duration: 08m 54s) [21:58:06] T320873: Consolidate rdbms logging channels into one - https://phabricator.wikimedia.org/T320873 [21:59:16] well, that went smoothly. [21:59:54] I like that it strictly provisions /srv/mediawiki the way it woudl be from clean state, so that basically removes the need to check git-status plus it does so anyway just in case as part of it, exactly as I normally do by hand. [22:00:18] Nod [22:01:43] The two bits of slight friction for me were: 1) It wasn't very obvious at first what happens on which server, but I think that's just me knowing too much e.g. does the l10n build happen on deploy or mwdebug, but shouldn't matter anyway. Once I saw the sync steps I'm confident nothing has happened anywhere else. 2) While it does all the steps I normally do, it does them all at once, which means given I'm by default in a GNU screen, [22:01:43] I now can't see what happened. This means I have to scrollback which is non-trivial in a screen. [22:01:51] (03CR) 10Dzahn: [C: 03+2] dumps/distribution: fix values that don't fit into data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [22:02:43] For #2, is another way of saying that "It prints too much stuff" ? [22:02:43] I followed the advice from long ago that it's best to always be in a screen when doing commands in prod you don't want interrupted half-way. [22:02:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:05] (03CR) 10Cwhite: [C: 03+2] "Existing metrics are unaffected. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842935 (https://phabricator.wikimedia.org/T320873) (owner: 10Krinkle) [22:03:08] dancy: I'm not sure, I didn't see obvious space to reduce output. [22:03:10] Krinkle: Ctrl + A + ESC - and then cursor [22:03:40] mutante: ack, I have it written down in a reminder file. But it's always a hassle, and doesn't make for good UX if it's required by default just to follow a process as intended. [22:04:04] it's "copy mode" for some reason. Yea, I only remembered this one time because I needed it within the last 24 hours :p [22:04:08] that is, you're meant to read it, so having it go past it by default is not great. [22:04:24] I agree it's a hassle [22:04:29] maybe some of it can be shortened, or more interactive steps to digest it, but that's also more steps for the user to say 'y' to. [22:04:41] or possibly just pressing tab, like in less pagers. [22:04:47] or enter etc. [22:05:05] eg. here is the git-status, does it look good? [22:05:28] or maybe we're saying you don't need to look at it, we're confidence it can't be dirty. [22:05:32] right [22:05:34] I trust the program [22:05:44] yea, same thing happens sometimes when you run a cookbook and are supposed to ACK a diff in DNS [22:05:47] in which case, maybe it can be omitted in favour of scap.verbose.debug in Logstash to dig up there if/when you want to afer the fact if something went wrong [22:05:52] If you told it to deploy x, that's what it will do.. if it finds that another commit arrived as well, it will prompt you [22:06:14] but if the only thing pulled down was what you asked, it doesn't ask you again. [22:06:46] yes, so I don't have a good answer, but I generally like not printing output that you're not meant (and for most ppl, can't/won't overcome friction to) read [22:07:20] basically just about the output before the first sync-testservers, after that it's slow enough to keep up [22:07:51] (03CR) 10Dzahn: [C: 03+2] "confirmed no change on clouddumps1002" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [22:08:13] (03PS8) 10Dzahn: dumps/distribution: move hardcoded host names to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) [22:08:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:55] (03CR) 10Dzahn: [C: 03+1] "CI should like it after https://gerrit.wikimedia.org/r/c/operations/puppet/+/855096 was merged" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:10:23] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10jhathaway) > In T321874#8401334, @Joe wrote: I’ll be very straightforward and say that while I think ansible has some merits (and some drawbacks) compared to puppet, there... [22:10:56] Krinkle: I'll take that into consideration. I'd like to see less output too. [22:14:19] (03PS1) 10Brennen Bearnes: MediaWiki: Temp silence FR-induced clearActionName warnings [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858317 (https://phabricator.wikimedia.org/T323254) [22:15:13] 10SRE: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) [22:16:16] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations: clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) [22:16:48] dancy: captured on-task with colours enabled etc for easy refernece [22:17:17] yeah, i'd agree less output would probably be a better user experience [22:17:24] Krinkle: shall i go ahead and deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/858317 ? [22:18:35] brennen: that's fine yeah. I don't know how high the levels will go on group2, not much higher given limited to FR wikis, but dewiki is a fairly lage wiki in group2 but not e.g. enwiki [22:18:45] (and dewiki has FR) [22:19:02] might as well silence it now [22:19:10] ::nod::, i'll do this and roll forward. thanks for all the help with this one. [22:19:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858317 (https://phabricator.wikimedia.org/T323254) (owner: 10Brennen Bearnes) [22:20:09] i <3 scap backport. [22:24:29] (03CR) 10Dzahn: [C: 03+1] "Hi Hannah, after the previous change has been merged, this also looks good now. CI likes it and the (brandnew version of) the puppet compi" [puppet] - 10https://gerrit.wikimedia.org/r/852259 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:27:13] (03CR) 10Dzahn: [C: 04-1] "meanwhile phab2001 is no more. rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [22:31:03] (03PS2) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [22:33:36] (03CR) 10Dzahn: [C: 04-1] "@John, after revisiting this, since phab2001 is now gone and after rebasing it.. it would just move stuff from phab1001 to common, but pha" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [22:34:07] (03Merged) 10jenkins-bot: MediaWiki: Temp silence FR-induced clearActionName warnings [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/858317 (https://phabricator.wikimedia.org/T323254) (owner: 10Brennen Bearnes) [22:34:31] !log brennen@deploy1002 Started scap: Backport for [[gerrit:858317|MediaWiki: Temp silence FR-induced clearActionName warnings (T323254)]] [22:34:37] T323254: MediaWiki->initializeArticle on FlaggedRevs wikis triggers deprecated "Unexpected clearActionName after getActionName" - https://phabricator.wikimedia.org/T323254 [22:34:55] !log brennen@deploy1002 brennen and brennen: Backport for [[gerrit:858317|MediaWiki: Temp silence FR-induced clearActionName warnings (T323254)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:35:11] (03PS3) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [22:36:59] (03PS2) 10Ottomata: WIP flink image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 [22:37:02] (03CR) 10Dzahn: "recycled to just drop the phab1001.yaml and just move "main::use_lvs". We will switch away from phab1001 this coming Monday." [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [22:37:15] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:37:37] (03PS3) 10Ottomata: WIP flink image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [22:37:43] (03PS4) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [22:39:15] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:39:38] (03CR) 10Dzahn: "have to double check if the use_lvs parameter has other effects for the web UI and not just VCS" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [22:41:48] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:858317|MediaWiki: Temp silence FR-induced clearActionName warnings (T323254)]] (duration: 07m 16s) [22:41:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:41:53] T323254: MediaWiki->initializeArticle on FlaggedRevs wikis triggers deprecated "Unexpected clearActionName after getActionName" - https://phabricator.wikimedia.org/T323254 [22:42:51] (03PS1) 10Dzahn: phabricator: switch logmail mails from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) [22:42:52] rolling train to all wikis. [22:45:07] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858434 (https://phabricator.wikimedia.org/T320515) [22:45:09] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858434 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [22:45:49] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858434 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [22:46:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:46:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:48:38] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:50:08] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.10 refs T320515 [22:50:13] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [22:53:32] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Papaul) I have a case open with Dell ` Service Request: 1115331653 [22:56:45] (JobUnavailable) resolved: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:59:06] (03PS7) 10Cathal Mooney: Add function to expose required device VRFs to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/857593 (https://phabricator.wikimedia.org/T312635) [23:01:59] (03CR) 10Dzahn: [V: 03+1] "I tested manually running one of these (/usr/local/bin/community_metrics.sh) and it talks to the m3-slave at the right port and it has the" [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:02:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:48] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:07:17] (03CR) 10Dzahn: [V: 03+1] "I also checked /usr/local/bin/project_changes.sh and /usr/local/bin/mfa_check.sh . all works for me. @Andre just fyi, mails will come " [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:07:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: switch logmail mails from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:08:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "puppet removed all the systemd timers/jobs etc from phab1001 and added them on phab1004 now. let me know if any issues, Andre. Avoiding du" [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:10:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "this was now possible on an "inactive" host because we fixed the DB connection to m3-slave" [puppet] - 10https://gerrit.wikimedia.org/r/858432 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:12:45] (JobUnavailable) firing: Reduced availability for job jmx_wcqs_blazegraph in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:23:09] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:46:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:56] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T323339 (10phaultfinder) [23:49:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring