[00:19:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424330 (10phaultfinder) [00:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 66410296 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 112 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107048 [00:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107048 (owner: 10TrainBranchBot) [00:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424336 (10phaultfinder) [00:54:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107048 (owner: 10TrainBranchBot) [01:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107049 [01:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107049 (owner: 10TrainBranchBot) [01:26:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107049 (owner: 10TrainBranchBot) [01:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424343 (10phaultfinder) [01:44:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/88835e017b3a101a74dd2ba088c0dceb775e0695b6d11d20a5134655fd40e198/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:04:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:12:46] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:17:48] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 123.71 ms [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424365 (10phaultfinder) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 848.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 860.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424451 (10phaultfinder) [12:10:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:14:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424460 (10phaultfinder) [12:15:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424494 (10phaultfinder) [13:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:37] 10SRE-swift-storage: Cannot move File:فندق قصبة بوزنيقة.jpg to File:Kasbah Hotel in Bouznika.jpg on Commons - https://phabricator.wikimedia.org/T382750#10424500 (10SD_hehua) represented [14:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424504 (10phaultfinder) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424526 (10phaultfinder) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:01:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:08:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424532 (10phaultfinder) [16:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424537 (10phaultfinder) [17:20:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424543 (10phaultfinder) [17:25:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424544 (10phaultfinder) [17:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:17] 10SRE-swift-storage, 10Thumbor: Thumbnail generation errors - https://phabricator.wikimedia.org/T382763#10424553 (10AntiCompositeNumber) Testing locally: ` thumbor-1 | 2024-12-28 18:53:07 thumbor:DEBUG [Proxy] load: '.jpg' thumbor-1 | 2024-12-28 18:53:07 thumbor:DEBUG [Proxy] Looking for a jpg engine thumbor... [18:55:24] 10SRE-swift-storage, 10Thumbor: Thumbnail generation errors - https://phabricator.wikimedia.org/T382763#10424555 (10AntiCompositeNumber) →14Duplicate dup:03T381594 [19:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424571 (10phaultfinder) [19:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424572 (10phaultfinder) [19:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424573 (10phaultfinder) [20:01:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:20:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424574 (10phaultfinder) [21:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424582 (10phaultfinder) [21:28:26] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424589 (10phaultfinder) [22:04:09] (03CR) 10Don-vip: [C:03+1] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [22:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10424607 (10phaultfinder) [22:46:40] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING WARNING - Packet loss = 77%, RTA = 30.55 ms [22:53:04] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%