[07:03:11] Hi, I forgot to mention this last week: it's a public holiday in Germany today (reunion day) [07:16:58] pfischer: enjoy! :) [09:23:36] Errand + early lunch [10:54:35] lunch [14:10:09] errand [15:02:33] inflatador: triage meeting: https://meet.google.com/eki-rafx-cxi [17:31:17] gehel: slo brainstorm https://meet.google.com/chj-uuwn-cev [18:14:57] dinner [18:35:55] ryankemper: I'm there [18:36:13] gehel: ah i must have joined the wrong one, sec [19:06:09] ryankemper / ebernhardson: I'm having drink with friends tomorrow evening, so I won't be there for our usual pairing session [19:06:26] ack [19:12:00] ebernhardson: I'm looking at `(CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient` [19:12:21] We have this documentation on https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell , is that what this is or would that be firing an alert about the old pool as opposed to the young? [19:14:43] In any case I'll drain `1066` now and see if the alert stops complaining after a restart [19:25:52] ryankemper: hmm, i suppose the runbook is now a bit incorrect, it's not really old gc hell anymore, it's plain old memory pressure [19:26:29] ryankemper: mostly i'd restart the instance, i was curious if we could have it auto-restart as long as the uptime of the service is more than N days. vo.lans said it wouldn't be super straight forward to accomplish though [19:27:55] "enabling cook.book runs "remotely" as an auto-remediation method is something in the TODO list. What can totally be doneight now is the other way around, having either a daemon or systemd timer that runs a script (using spicerack) or a coo [19:27:57] k.book that checks either alertmanager or directly some metric on prometheus and then does what" [19:28:32] if they are frequent though we probably need to give the instances more memory [19:28:35] maybe +1GB [19:30:07] the JVM heap - young pool graph there does look like it's running out of memory, old pool steadly increasing while young pool gets less and less space to work with [19:30:42] it's not clear where it would flatline though, if memory memory would be sufficient [19:31:05] s/memory memory/more memory/ [19:38:08] Just realized I banned it with transient but should probably switch to permanent now [19:39:41] * ebernhardson wonders where all the memory goes...it's not captured in the various graphs were report there. Maybe could take a snapshot of localhost:9200/_nodes/_local/stats/jvm each time an instance gets into memory trouble to compare later, but most of that is probably already in prometheus [19:48:40] poking values suffixed _bytes from node stats..nothing particularly suspicious. buffer pool used bytes seems a bit high, but i think thats off-heap memory [19:49:15] (by bit-high, i mean 1066 had the second-highest value in the cluster, per max(elasticsearch_jvm_buffer_pool_used_bytes{exported_cluster="production-search-psi-eqiad"}) by (name))