[00:06:41] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31799 and previous config saved to /var/cache/conftool/dbconfig/20220724-000641-ladsgroup.json [00:21:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31800 and previous config saved to /var/cache/conftool/dbconfig/20220724-002147-ladsgroup.json [00:36:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312863)', diff saved to https://phabricator.wikimedia.org/P31801 and previous config saved to /var/cache/conftool/dbconfig/20220724-003652-ladsgroup.json [00:36:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [00:36:57] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:37:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [00:37:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:37:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31802 and previous config saved to /var/cache/conftool/dbconfig/20220724-003718-ladsgroup.json [00:43:03] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312863)', diff saved to https://phabricator.wikimedia.org/P31803 and previous config saved to /var/cache/conftool/dbconfig/20220724-025820-ladsgroup.json [02:58:27] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [03:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31804 and previous config saved to /var/cache/conftool/dbconfig/20220724-031326-ladsgroup.json [03:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31805 and previous config saved to /var/cache/conftool/dbconfig/20220724-032831-ladsgroup.json [03:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31806 and previous config saved to /var/cache/conftool/dbconfig/20220724-033027-ladsgroup.json [03:30:32] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [03:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312863)', diff saved to https://phabricator.wikimedia.org/P31807 and previous config saved to /var/cache/conftool/dbconfig/20220724-034336-ladsgroup.json [03:43:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [03:43:43] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [03:43:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [03:43:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31808 and previous config saved to /var/cache/conftool/dbconfig/20220724-034356-ladsgroup.json [03:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31809 and previous config saved to /var/cache/conftool/dbconfig/20220724-034532-ladsgroup.json [04:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31810 and previous config saved to /var/cache/conftool/dbconfig/20220724-040037-ladsgroup.json [04:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31811 and previous config saved to /var/cache/conftool/dbconfig/20220724-041542-ladsgroup.json [04:15:48] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220724T0700) [07:06:01] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:51] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:45] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1696 MB (1% inode=84%): /tmp 1696 MB (1% inode=84%): /var/tmp 1696 MB (1% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:30:43] that archiva issue ^ is https://phabricator.wikimedia.org/T313386 [08:31:19] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10hashar) It is erroring out now: PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1696 MB (1% inode=84%) /t... [08:47:53] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31812 and previous config saved to /var/cache/conftool/dbconfig/20220724-091706-ladsgroup.json [09:17:14] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [09:32:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31813 and previous config saved to /var/cache/conftool/dbconfig/20220724-093211-ladsgroup.json [09:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31814 and previous config saved to /var/cache/conftool/dbconfig/20220724-094716-ladsgroup.json [09:49:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31815 and previous config saved to /var/cache/conftool/dbconfig/20220724-100221-ladsgroup.json [10:02:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:02:27] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:02:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:30:55] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:21:51] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:32] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance) [12:46:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:55] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance) By the way, I want to emphasize that QUIC encrypts initial packets of a connection. Even though its key is known, as a result of it, QUIC... [13:29:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:30:59] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:36:35] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:39:09] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:43:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [14:57:11] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:32:12] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I'm looking into this issue now, since I'm on leave next week and I would rather not leave it any longer. I will take the advice... [15:38:49] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I checked that the primary (ganeti1008) and secondary (ganeti1025) nodes both have plenty of spare disk space: Then I added a 2... [16:17:37] (03PS1) 10Abijeet Patro: ReviewTranslationActionApi: Move to namespace and add strict types [extensions/Translate] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/816272 (https://phabricator.wikimedia.org/T312008) [16:23:19] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:36:06] (03PS2) 10Abijeet Patro: ReviewTranslationActionApi: Move to namespace and add strict types [extensions/Translate] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/816272 (https://phabricator.wikimedia.org/T312008) [17:01:39] (03PS1) 10Stang: ptwikinews: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816316 (https://phabricator.wikimedia.org/T313173) [17:05:37] (03Abandoned) 10Stang: id_internalwikimedia: Enable extension UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T304291) (owner: 10Stang) [17:06:07] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:08:35] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:03:17] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:37:10] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM archiva1002.wikimedia.org [20:37:32] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10ops-monitoring-bot) VM archiva1002.wikimedia.org rebooted by btullis@cumin1001 with reason: Adding disk [20:44:49] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:49:04] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I had to rename the network interface from ens5 to ens14 in `/etc/network/interfaces` as described here: https://wikitech.wikime... [20:54:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM archiva1002.wikimedia.org [20:55:23] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:46] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I've mounted `/dev/vdb1` to `/mnt` temporarily and started an rsync operation with: ` sudo rsync -av /var/lib/archiva/ /mnt ` On... [21:03:03] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:55] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) All steps above have now been follwed, except the final removal of the backup in `/var/lib/archiva-bak` The service starts and... [21:17:42] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) The git-fat link service apepars to work without errors: ` Jul 24 21:15:01 archiva1002 systemd[1]: Started Archiva tool to creat... [21:18:42] 10SRE, 10Discovery, 10wmde-team-b-tech, 10Data Engineering Planning (Sprint 01): archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) 05Open→03Resolved p:05Triage→03High a:03BTullis [21:21:07] 10SRE, 10Discovery, 10wmde-team-b-tech, 10Data Engineering Planning (Sprint 01): archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) The disk space is now looking much more healthy. ` btullis@archiva1002:~$ df -h -t ext4 Filesystem Siz... [21:25:27] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:47:45] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook