[00:00:54] RECOVERY - Check systemd state on phab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:58] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:18] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:38] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [01:08:02] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [01:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:43] (03PS1) 10Dzahn: devtools: set mariadb datadir path for phorge-1001 instance [puppet] - 10https://gerrit.wikimedia.org/r/890131 (https://phabricator.wikimedia.org/T328595) [02:11:59] (03CR) 10Dzahn: [C: 03+2] devtools: set mariadb datadir path for phorge-1001 instance [puppet] - 10https://gerrit.wikimedia.org/r/890131 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [02:17:17] (03PS1) 10Dzahn: phorge: install php-zip and php-gd packages [puppet] - 10https://gerrit.wikimedia.org/r/890132 (https://phabricator.wikimedia.org/T328595) [02:20:08] (03CR) 10Dzahn: [C: 03+2] phorge: install php-zip and php-gd packages [puppet] - 10https://gerrit.wikimedia.org/r/890132 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [02:22:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:52] (03PS1) 10Dzahn: phorge: install php-apcu and python3-pygments [puppet] - 10https://gerrit.wikimedia.org/r/890133 (https://phabricator.wikimedia.org/T328595) [02:29:58] (03CR) 10Krinkle: [C: 03+1] "emotional support and evidence of working mouse. Would clear some warning noise :)" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall) [02:32:23] (03CR) 10Dzahn: [C: 03+2] phorge: install php-apcu and python3-pygments [puppet] - 10https://gerrit.wikimedia.org/r/890133 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [02:44:20] (03PS1) 10Dzahn: phorge: add parameter and value for the repo path [puppet] - 10https://gerrit.wikimedia.org/r/890134 (https://phabricator.wikimedia.org/T328595) [02:46:54] (03CR) 10Dzahn: [C: 03+2] phorge: add parameter and value for the repo path [puppet] - 10https://gerrit.wikimedia.org/r/890134 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [02:48:57] (03PS1) 10Dzahn: devtools: fix typo in hiera key name for phorge [puppet] - 10https://gerrit.wikimedia.org/r/890135 [02:49:29] (03CR) 10Dzahn: [C: 03+2] devtools: fix typo in hiera key name for phorge [puppet] - 10https://gerrit.wikimedia.org/r/890135 (owner: 10Dzahn) [04:15:54] (03PS1) 10Sushrith Bogi: Reduce height of the article toolbar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) [05:08:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [05:12:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [05:51:18] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.017e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [05:57:16] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:52] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:22:50] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:00:04] PROBLEM - Disk space on kubestagetcd1006 is CRITICAL: DISK CRITICAL - free space: / 709 MB (3% inode=95%): /tmp 709 MB (3% inode=95%): /var/tmp 709 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230218T0800) [08:21:11] !log delete /var/log/syslog.1 on kubestageetcd1006 to free space [08:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:31] !log delete /var/log/{messages,user.log).1 on kubestageetcd1006 to free space [08:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:22] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:32] !log kill leftover processes of user `mepps` (offboarded) from stat100[4,5] to unblock puppet [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:40] RECOVERY - Disk space on kubestagetcd1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=kubestagetcd1006&var-datasource=eqiad+prometheus/ops [09:09:00] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:13:04] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [10:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:25:35] (03PS8) 10Fomafix: Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) [12:27:21] (03PS4) 10Fomafix: Add 'rup' as alias for 'roa-rup' [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) [12:28:44] (03PS4) 10Fomafix: Add 'vro' as alias for 'fiu-vro' [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) [12:29:31] (03PS5) 10Fomafix: Add 'egl' as alias for 'eml' [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) [12:34:53] (03PS5) 10Fomafix: Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) [12:36:36] (03PS9) 10Fomafix: Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) [12:57:28] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10TheDJ) 05Open→03Resolved a:03TheDJ [12:58:20] (03CR) 10Aklapper: "Sushrith: Did you test this locally in your MediaWiki setup and can you confirm that it does fix the problem?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) (owner: 10Sushrith Bogi) [13:13:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:17:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:58:02] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:01:28] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:20:30] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:57] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:57] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:32] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10Legoktm) >>! In T291323#8626159, @BCornwall wrote: >..., @Legoktm, ... can each of you approve of relicensing the content of your work in the operations/... [15:11:15] (03CR) 10Legoktm: [C: 03+1] utils: Add SPDX Apache-2.0 license to utils [dns] - 10https://gerrit.wikimedia.org/r/890016 (https://phabricator.wikimedia.org/T291323) (owner: 10BCornwall) [17:14:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [17:15:58] PROBLEM - Host thumbor1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:03] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [18:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:50] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:36] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [21:22:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [22:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable