[19:06:11] Scap was having issues reaching mw2448.codfw.wmnet, timing out. My own attempt to ssh to it, as well as Grafana host overview board, confirm that the host is effectively down. [19:06:22] I don't see any SAL entries or phab tickets about it in recent history. [19:06:27] https://sal.toolforge.org/production?p=0&q=mw2448&d= [19:06:33] something in February about an OS reimage [19:06:57] PROBLEM - Host mw2448 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:02] that was yesterday [19:07:26] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw2448&var-datasource=thanos&var-cluster=appserver&from=now-2d&to=now [19:13:26] it'd be really nice if we were to the point that a host being unexpectedly down for over a day automatically opened a ticket :) [19:13:33] I'll file one right now though [19:14:46] I'll just powercycle the thing [19:14:54] used to be common :P [19:15:49] (that appservers sometimes crash like this) [19:19:18] server is booted back up now [19:19:44] thanks mutante, anything in SEL? [19:20:18] hmmm. "An OEM diagnostic event occurred." [19:20:43] "A problem was detected related to the previous server boot." [19:21:05] cdanis: multiple "diagnostic events" on April 9th.. but what does that even mean [19:21:16] not how it looks when we have bad RAM [19:21:38] ah ok, "CPU 2 machine check error detected." there we go [19:21:50] I will paste it all on ticket [19:25:40] host is pooled in etcd, so I ran a "scap pull" to get this to latest deployment