[00:01:33] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:28] (03PS1) 10Jeena Huneidi: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/990332 [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990178 [00:38:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990178 (owner: 10TrainBranchBot) [00:45:23] (03PS2) 10Platonides: lists: Allow images from upload.wikimedia.org in CSP [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [00:47:48] (03CR) 10Platonides: lists: Allow images from upload.wikimedia.org in CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [01:13:57] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:14:16] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [01:15:21] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Swift [01:15:39] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.272 second response time https://wikitech.wikimedia.org/wiki/Swift [01:16:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/990178 (owner: 10TrainBranchBot) [02:05:41] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:08:37] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26899 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:26:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:27:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:13] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:13] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:57] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [05:13:05] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [05:48:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:53:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:53:45] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:58:45] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:39:06] (03PS3) 10Anzx: thwiki: update tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [06:39:44] (03PS4) 10Anzx: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:13] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:51:45] 10SRE-tools, 10Infrastructure-Foundations: Upgrade BGPAlerter to 1.33 - https://phabricator.wikimedia.org/T354998 (10ayounsi) [09:11:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:41:44] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:09:13] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:31:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:56:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:16:00] (03PS16) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [14:39:14] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:13] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:50:55] (03PS4) 10Slyngshede: Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [17:51:22] (03CR) 10Slyngshede: Ganeti memory preassure alerting. (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:52:54] (03CR) 10CI reject: [V: 04-1] Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:58:35] (03PS5) 10Slyngshede: Ganeti memory preassure alerting. [alerts] - 10https://gerrit.wikimedia.org/r/989097 (https://phabricator.wikimedia.org/T350694) [17:59:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:59:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:02:09] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.551 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:11:53] PROBLEM - WDQS SPARQL on wdqs1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:43:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:44:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:44:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:47:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.620 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:47:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:18] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable