[00:06:41] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31799 and previous config saved to /var/cache/conftool/dbconfig/20220724-000641-ladsgroup.json
[00:21:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P31800 and previous config saved to /var/cache/conftool/dbconfig/20220724-002147-ladsgroup.json
[00:36:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312863)', diff saved to https://phabricator.wikimedia.org/P31801 and previous config saved to /var/cache/conftool/dbconfig/20220724-003652-ladsgroup.json
[00:36:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[00:36:57] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[00:37:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[00:37:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[00:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[00:37:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31802 and previous config saved to /var/cache/conftool/dbconfig/20220724-003718-ladsgroup.json
[00:43:03] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312863)', diff saved to https://phabricator.wikimedia.org/P31803 and previous config saved to /var/cache/conftool/dbconfig/20220724-025820-ladsgroup.json
[02:58:27] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[03:13:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31804 and previous config saved to /var/cache/conftool/dbconfig/20220724-031326-ladsgroup.json
[03:28:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31805 and previous config saved to /var/cache/conftool/dbconfig/20220724-032831-ladsgroup.json
[03:30:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31806 and previous config saved to /var/cache/conftool/dbconfig/20220724-033027-ladsgroup.json
[03:30:32] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[03:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312863)', diff saved to https://phabricator.wikimedia.org/P31807 and previous config saved to /var/cache/conftool/dbconfig/20220724-034336-ladsgroup.json
[03:43:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[03:43:43] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[03:43:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[03:43:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31808 and previous config saved to /var/cache/conftool/dbconfig/20220724-034356-ladsgroup.json
[03:45:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31809 and previous config saved to /var/cache/conftool/dbconfig/20220724-034532-ladsgroup.json
[04:00:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31810 and previous config saved to /var/cache/conftool/dbconfig/20220724-040037-ladsgroup.json
[04:15:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312863)', diff saved to https://phabricator.wikimedia.org/P31811 and previous config saved to /var/cache/conftool/dbconfig/20220724-041542-ladsgroup.json
[04:15:48] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220724T0700)
[07:06:01] <icinga-wm>	 PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:24:51] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:04:45] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1696 MB (1% inode=84%): /tmp 1696 MB (1% inode=84%): /var/tmp 1696 MB (1% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[08:30:43] <hasharAway>	 that archiva issue ^  is https://phabricator.wikimedia.org/T313386
[08:31:19] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10hashar) It is erroring out now:  PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 1696 MB (1% inode=84%) /t...
[08:47:53] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:17:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31812 and previous config saved to /var/cache/conftool/dbconfig/20220724-091706-ladsgroup.json
[09:17:14] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[09:32:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31813 and previous config saved to /var/cache/conftool/dbconfig/20220724-093211-ladsgroup.json
[09:47:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31814 and previous config saved to /var/cache/conftool/dbconfig/20220724-094716-ladsgroup.json
[09:49:21] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:02:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312863)', diff saved to https://phabricator.wikimedia.org/P31815 and previous config saved to /var/cache/conftool/dbconfig/20220724-100221-ladsgroup.json
[10:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[10:02:27] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[10:02:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[11:30:55] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:21:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:32] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance)
[12:46:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:55] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance) By the way, I want to emphasize that QUIC encrypts initial packets of a connection. Even though its key is known, as a result of it, QUIC...
[13:29:33] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:30:59] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:36:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:39:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:42:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[14:43:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[14:57:11] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:32:12] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I'm looking into this issue now, since I'm on leave next week and I would rather not leave it any longer. I will take the advice...
[15:38:49] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I checked that the primary (ganeti1008) and secondary (ganeti1025) nodes both have plenty of spare disk space:  Then I added a 2...
[16:17:37] <wikibugs>	 (03PS1) 10Abijeet Patro: ReviewTranslationActionApi: Move to namespace and add strict types [extensions/Translate] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/816272 (https://phabricator.wikimedia.org/T312008)
[16:23:19] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:36:06] <wikibugs>	 (03PS2) 10Abijeet Patro: ReviewTranslationActionApi: Move to namespace and add strict types [extensions/Translate] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/816272 (https://phabricator.wikimedia.org/T312008)
[17:01:39] <wikibugs>	 (03PS1) 10Stang: ptwikinews: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816316 (https://phabricator.wikimedia.org/T313173)
[17:05:37] <wikibugs>	 (03Abandoned) 10Stang: id_internalwikimedia: Enable extension UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788812 (https://phabricator.wikimedia.org/T304291) (owner: 10Stang)
[17:06:07] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[17:08:35] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[19:03:17] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:37:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM archiva1002.wikimedia.org
[20:37:32] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10ops-monitoring-bot) VM archiva1002.wikimedia.org rebooted by btullis@cumin1001 with reason: Adding disk
[20:44:49] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:49:04] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I had to rename the network interface from ens5 to ens14 in `/etc/network/interfaces` as described here: https://wikitech.wikime...
[20:54:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM archiva1002.wikimedia.org
[20:55:23] <icinga-wm>	 PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:59:46] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I've mounted `/dev/vdb1` to `/mnt` temporarily and started an rsync operation with: ` sudo rsync -av /var/lib/archiva/ /mnt ` On...
[21:03:03] <icinga-wm>	 RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:14:55] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) All steps above have now been follwed, except the final removal of the backup in `/var/lib/archiva-bak`  The service starts and...
[21:17:42] <wikibugs>	 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) The git-fat link service apepars to work without errors: ` Jul 24 21:15:01 archiva1002 systemd[1]: Started Archiva tool to creat...
[21:18:42] <wikibugs>	 10SRE, 10Discovery, 10wmde-team-b-tech, 10Data Engineering Planning (Sprint 01): archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) 05Open→03Resolved p:05Triage→03High a:03BTullis
[21:21:07] <wikibugs>	 10SRE, 10Discovery, 10wmde-team-b-tech, 10Data Engineering Planning (Sprint 01): archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) The disk space is now looking much more healthy. ` btullis@archiva1002:~$ df -h -t ext4 Filesystem      Siz...
[21:25:27] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:47:45] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook