[00:17:28] !log fab@deploy2002 Started deploy [airflow-dags/research@5edcd7b]: (no justification provided) [00:17:34] !log fab@deploy2002 Finished deploy [airflow-dags/research@5edcd7b]: (no justification provided) (duration: 00m 05s) [01:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 174 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:05:27] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:02] (03PS1) 10ArielGlenn: Make dumpsdata1005 the nfs primary for xmldumps and dumpsdata1003 a spare [puppet] - 10https://gerrit.wikimedia.org/r/900801 (https://phabricator.wikimedia.org/T330573) [06:40:32] I'll be making that happen later in the day, in some hours, not right now, for anyone following along. [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230319T0700) [07:56:41] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:05:59] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:14] (03CR) 10Vipz: [C: 03+1] "This was the variant of localized 'Wikipedia' agreed on by community consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900675 (https://phabricator.wikimedia.org/T332468) (owner: 10Acamicamacaraca) [09:17:37] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:19] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:35] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:15] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:31] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:45] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:45] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:21] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:25] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:07] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:04] (03Abandoned) 10Majavah: P:wmcs::nfs: maintain_dbusers: fix monitoring on inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/899476 (owner: 10Majavah) [11:47:25] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:59] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:07] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:45] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:49] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:27] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:39] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:19] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:31] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:09] PROBLEM - puppet last run on wdqs1012 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:58:05] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:13] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:19] !log work starting now to swap dumpsdata1005 in for primary nfs server, replacing dumpsdata1003 which will become dumps spare host [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:47] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:26] (03CR) 10ArielGlenn: [C: 03+2] Make dumpsdata1005 the nfs primary for xmldumps and dumpsdata1003 a spare [puppet] - 10https://gerrit.wikimedia.org/r/900801 (https://phabricator.wikimedia.org/T330573) (owner: 10ArielGlenn) [14:47:55] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:27] if you see whines from snapshot1009, please ignore, I'm working on it (it's a testbed host in any case) [14:57:33] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:07] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:07] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:39] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:47:17] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:35] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:58:27] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:09] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:41] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:17] !log dumpsdata1005 conversion to primary dumps nfs server done [16:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:51] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:29] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:35] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:01] PROBLEM - Disk space on mwlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 443692 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwlog1002&var-datasource=eqiad+prometheus/ops [17:28:09] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:21] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:57] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:13] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:51] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:16] (03PS3) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [18:33:00] (03PS4) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [18:47:03] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:23] 10ops-eqiad, 10DC-Ops: Eqiad: Spint Week main tacking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [18:53:39] 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tacking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [18:54:03] 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [18:58:37] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:37] (03PS5) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [19:00:46] 10ops-eqiad, 10DC-Ops: Netbox reports cleanup - https://phabricator.wikimedia.org/T332518 (10Papaul) [19:17:47] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:47] (03PS6) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [19:27:44] (03PS7) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [19:29:23] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:28] 10SRE: Training checklist runbook review (Sprint Week 2023-03) - https://phabricator.wikimedia.org/T332391 (10LSobanski) [19:48:37] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:51] (03PS8) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [19:58:15] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:06] (03PS1) 10Samtar: InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) [20:16:06] 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [20:17:33] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:42] (03PS9) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [20:29:05] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:50] 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [20:40:30] 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [20:43:39] (03PS10) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [20:48:21] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:01] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:32] (03CR) 10Daimona Eaytoy: [C: 03+1] InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) (owner: 10Samtar) [21:17:25] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:01] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:19] PROBLEM - Query Service HTTP Port on wdqs1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:48:17] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:53] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:58:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49710 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:17:15] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:49] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:13] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:53] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:57] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:39] RECOVERY - Disk space on mwlog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwlog1002&var-datasource=eqiad+prometheus/ops [23:28:37] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:57] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:37] (03CR) 10Samtar: [C: 04-2] "Task stalled, T332006#8708420" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: 10Samtar) [23:59:29] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state