[00:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:20:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:40:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 [00:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 (owner: 10TrainBranchBot) [00:44:44] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 5579.10 ms [00:45:10] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 942.22 ms [00:46:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:52:25] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1221104 (owner: 10TrainBranchBot) [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:02] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:10:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 [01:10:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 (owner: 10TrainBranchBot) [01:11:02] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:24:51] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 24m 10s) [01:27:54] PROBLEM - Host ncredir7003 is DOWN: CRITICAL - Time to live exceeded (10.140.2.3) [01:28:08] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 137.02 ms [01:32:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1221105 (owner: 10TrainBranchBot) [01:48:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:24:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:25:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:12] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2917.26 ms [02:31:36] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:35:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:23:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:49:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:01:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:26:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:27:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:52:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:53:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:32] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8443.39 ms [05:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:22] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:53:14] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [05:54:10] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:04:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:14:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:21:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:26:43] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:23:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:48:38] FIRING: GnmiTargetDown: lsw1-b6-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:10:26] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire