[00:01:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:05:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:06:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:20:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 77789520 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:21:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 69248 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:23:36] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [00:23:44] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 104.20 ms [00:39:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220869 [00:39:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220869 (owner: 10TrainBranchBot) [00:50:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:51:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220869 (owner: 10TrainBranchBot) [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:06:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:10:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220870 [01:10:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220870 (owner: 10TrainBranchBot) [01:32:52] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220870 (owner: 10TrainBranchBot) [01:50:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:56:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:56:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:57:34] !incidents [01:57:34] 7235 (UNACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [01:57:34] 7234 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [01:57:34] 7233 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [01:57:41] !ack 7235 [01:57:42] 7235 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [01:57:46] well that's an interesting one [01:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:01:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:04:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:05:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:21:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:28:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:33:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:05:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:08:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:13:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:39:50] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [03:40:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:40:40] PROBLEM - Host lsw1-b6-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:48:22] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:41:36] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [04:41:44] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 78.23 ms [04:42:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:48:20] PROBLEM - Host asw1-b4-magru is DOWN: PING CRITICAL - Packet loss = 100% [04:48:20] PROBLEM - Host asw1-b3-magru is DOWN: PING CRITICAL - Packet loss = 100% [04:48:48] RECOVERY - Host asw1-b4-magru is UP: PING OK - Packet loss = 0%, RTA = 161.66 ms [04:48:48] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 141.68 ms [05:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:51] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:23:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:28:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:33:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:14:32] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [07:14:32] PROBLEM - Host asw1-b3-magru is DOWN: CRITICAL - Time to live exceeded (195.200.68.130) [07:14:36] PROBLEM - Host install7002 is DOWN: CRITICAL - Time to live exceeded (195.200.68.100) [07:15:22] RECOVERY - Host install7002 is UP: PING OK - Packet loss = 0%, RTA = 137.10 ms [07:15:24] RECOVERY - Host asw1-b3-magru is UP: PING OK - Packet loss = 0%, RTA = 142.03 ms [07:15:28] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.42 ms [07:28:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:38:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:41:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:37] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251225T0800) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251225T0800) [08:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:56:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:09:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:55:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:21:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:40:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:06:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:11:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:16:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:22:11] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1221039 (owner: 10L10n-bot) [12:25:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:50:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:55:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:01:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:05:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:10:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:10:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:28:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:29:00] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:30:25] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks. Indeed running `grep -r "return 'en'" .` in upstream returns zero results, while `grep -r "return 'en_" .` gets more interesting." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217872 (https://phabricator.wikimedia.org/T412651) (owner: 10Pppery) [13:33:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:44:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:44:53] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Makes sense, and behavior/output is no difference before and after." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217882 (https://phabricator.wikimedia.org/T412650) (owner: 10Pppery) [13:49:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:55:56] (03CR) 10Aklapper: [V:03+2 C:03+2] "lgtm" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217883 (https://phabricator.wikimedia.org/T412652) (owner: 10Pppery) [14:17:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:24:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:25:16] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! This change seems to remove that `Ignoring string "{"authors":["foo"]}"; not present in translation source file.` output noise whe" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217878 (https://phabricator.wikimedia.org/T412649) (owner: 10Pppery) [14:29:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:27] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2079.mgmt:22 - https://phabricator.wikimedia.org/T413475 (10phaultfinder) 03NEW [15:39:28] 10ops-codfw, 06DC-Ops: Unresponsive management for ml-serve2006.mgmt:22 - https://phabricator.wikimedia.org/T413476 (10phaultfinder) 03NEW [15:39:29] 10ops-codfw, 06DC-Ops: Unresponsive management for aqs2007.mgmt:22 - https://phabricator.wikimedia.org/T413474 (10phaultfinder) 03NEW [15:40:24] 10ops-codfw, 06DC-Ops: Unresponsive management for aqs2006.mgmt:22 - https://phabricator.wikimedia.org/T413479 (10phaultfinder) 03NEW [15:40:25] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2104.mgmt:22 - https://phabricator.wikimedia.org/T413477 (10phaultfinder) 03NEW [15:40:26] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2279.mgmt:22 - https://phabricator.wikimedia.org/T413478 (10phaultfinder) 03NEW [15:40:28] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2106.mgmt:22 - https://phabricator.wikimedia.org/T413481 (10phaultfinder) 03NEW [15:40:29] 10ops-codfw, 06DC-Ops: Unresponsive management for db2161.mgmt:22 - https://phabricator.wikimedia.org/T413483 (10phaultfinder) 03NEW [15:40:30] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2101.mgmt:22 - https://phabricator.wikimedia.org/T413480 (10phaultfinder) 03NEW [15:40:31] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2280.mgmt:22 - https://phabricator.wikimedia.org/T413482 (10phaultfinder) 03NEW [15:40:35] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2100.mgmt:22 - https://phabricator.wikimedia.org/T413484 (10phaultfinder) 03NEW [15:40:39] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2102.mgmt:22 - https://phabricator.wikimedia.org/T413485 (10phaultfinder) 03NEW [15:40:43] 10ops-codfw, 06DC-Ops: Unresponsive management for db2162.mgmt:22 - https://phabricator.wikimedia.org/T413487 (10phaultfinder) 03NEW [15:40:47] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2029.mgmt:22 - https://phabricator.wikimedia.org/T413488 (10phaultfinder) 03NEW [15:40:51] 10ops-codfw, 06DC-Ops: Unresponsive management for restbase2024.mgmt:22 - https://phabricator.wikimedia.org/T413486 (10phaultfinder) 03NEW [15:40:55] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2094.mgmt:22 - https://phabricator.wikimedia.org/T413489 (10phaultfinder) 03NEW [15:41:23] 10ops-codfw, 06DC-Ops: Unresponsive management for aqs2008.mgmt:22 - https://phabricator.wikimedia.org/T413490 (10phaultfinder) 03NEW [15:41:25] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2007.mgmt:22 - https://phabricator.wikimedia.org/T413491 (10phaultfinder) 03NEW [15:41:25] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2010.mgmt:22 - https://phabricator.wikimedia.org/T413492 (10phaultfinder) 03NEW [15:41:26] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2281.mgmt:22 - https://phabricator.wikimedia.org/T413493 (10phaultfinder) 03NEW [15:41:27] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2093.mgmt:22 - https://phabricator.wikimedia.org/T413494 (10phaultfinder) 03NEW [15:41:28] 10ops-codfw, 06DC-Ops: Unresponsive management for aqs2005.mgmt:22 - https://phabricator.wikimedia.org/T413495 (10phaultfinder) 03NEW [15:41:30] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2105.mgmt:22 - https://phabricator.wikimedia.org/T413498 (10phaultfinder) 03NEW [15:41:33] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2103.mgmt:22 - https://phabricator.wikimedia.org/T413497 (10phaultfinder) 03NEW [15:41:37] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2283.mgmt:22 - https://phabricator.wikimedia.org/T413496 (10phaultfinder) 03NEW [15:41:41] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2009.mgmt:22 - https://phabricator.wikimedia.org/T413501 (10phaultfinder) 03NEW [15:41:45] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2008.mgmt:22 - https://phabricator.wikimedia.org/T413499 (10phaultfinder) 03NEW [15:41:49] 10ops-codfw, 06DC-Ops: Unresponsive management for rdb2008.mgmt:22 - https://phabricator.wikimedia.org/T413500 (10phaultfinder) 03NEW [15:42:30] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2099.mgmt:22 - https://phabricator.wikimedia.org/T413503 (10phaultfinder) 03NEW [15:42:30] 10ops-codfw, 06DC-Ops: Unresponsive management for ms-be2082.mgmt:22 - https://phabricator.wikimedia.org/T413502 (10phaultfinder) 03NEW [15:42:31] 10ops-codfw, 06DC-Ops: Unresponsive management for wcqs2001.mgmt:22 - https://phabricator.wikimedia.org/T413505 (10phaultfinder) 03NEW [15:42:32] 10ops-codfw, 06DC-Ops: Unresponsive management for wikikube-worker2282.mgmt:22 - https://phabricator.wikimedia.org/T413504 (10phaultfinder) 03NEW [15:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [15:51:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:56:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:57:19] (03PS2) 10Pppery: Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 [15:58:02] (03CR) 10Aklapper: [V:03+2] Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 (owner: 10Pppery) [15:58:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:03:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:09:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:10:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:19:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:20:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:21:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:26:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:32:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:37:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:46:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:51:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:56:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:22:18] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11484718 (10A_smart_kitten) [17:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:42:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:44:41] !incidents [17:44:41] 7236 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [17:44:41] 7235 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [17:47:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:51:14] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:51:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:14] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:55:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:00:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:10:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:11:08] !incidents [18:11:08] 7237 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [18:11:08] 7236 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [18:11:08] 7235 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [18:25:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:05:24] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [19:05:24] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [19:35:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:35:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:36:07] !incidents [19:36:07] 7238 (UNACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:36:08] 7237 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:36:08] 7236 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:36:08] 7235 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [19:36:12] !ack 7238 [19:36:12] 7238 (ACKED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [19:41:46] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:41:56] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:42:03] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:46:45] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [20:03:29] !incidents [20:03:30] 7238 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [20:03:30] 7237 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [20:03:30] 7236 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet esams) [20:03:30] 7235 (RESOLVED) ATSBackendErrorsHigh cache_text sre (rest-gateway-ro.discovery.wmnet eqsin) [20:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:16:22] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:20:48] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:21:00] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:26:26] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:26:54] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:29:02] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:29:56] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:32:44] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:33:18] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:35:02] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:35:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:38:58] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:39:00] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:40:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from rest-gateway-ro.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=rest-gateway-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:43:08] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:43:08] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:51:06] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:51:06] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:53:40] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:53:40] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:55:54] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:55:54] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:04:30] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:04:32] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:11:10] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:11:10] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:13:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:16:44] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:16:44] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:18:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:18:40] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:22:04] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:24:02] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [21:24:06] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate status.wikimedia.org valid until 2025-12-28 19:04:50 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:47:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::5e5e:ab00:c3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:52:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::5e5e:ab00:c3d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:53:50] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:50] PROBLEM - Host doh7004 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:28] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 137.03 ms [22:54:28] RECOVERY - Host doh7004 is UP: PING OK - Packet loss = 0%, RTA = 137.11 ms [22:58:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:28:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:40:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown