[07:03:11] <pfischer>	 Hi, I forgot to mention this last week: it's a public holiday in Germany today (reunion day)
[07:16:58] <dcausse>	 pfischer: enjoy! :)
[09:23:36] <gehel>	 Errand + early lunch 
[10:54:35] <dcausse>	 lunch
[14:10:09] <dcausse>	 errand
[15:02:33] <gehel>	 inflatador: triage meeting: https://meet.google.com/eki-rafx-cxi
[17:31:17] <ryankemper>	 gehel: slo brainstorm https://meet.google.com/chj-uuwn-cev
[18:14:57] <dcausse>	 dinner
[18:35:55] <gehel>	 ryankemper: I'm there
[18:36:13] <ryankemper>	 gehel: ah i must have joined the wrong one, sec
[19:06:09] <gehel>	 ryankemper / ebernhardson: I'm having drink with friends tomorrow evening, so I won't be there for our usual pairing session
[19:06:26] <ryankemper>	 ack
[19:12:00] <ryankemper>	 ebernhardson: I'm looking at `(CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1066-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient`
[19:12:21] <ryankemper>	 We have this documentation on https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell , is that what this is or would that be firing an alert about the old pool as opposed to the young?
[19:14:43] <ryankemper>	 In any case I'll drain `1066` now and see if the alert stops complaining after a restart
[19:25:52] <ebernhardson>	 ryankemper: hmm, i suppose the runbook is now a bit incorrect, it's not really old gc hell anymore, it's plain old memory pressure
[19:26:29] <ebernhardson>	 ryankemper: mostly i'd restart the instance, i was curious if we could have it auto-restart as long as the uptime of the service is more than N days.  vo.lans said it wouldn't be super straight forward to accomplish though
[19:27:55] <ebernhardson>	 "enabling cook.book runs "remotely" as an auto-remediation method is something in the TODO list. What can totally be doneight now is the other way around, having either a daemon or systemd timer that runs a script (using spicerack) or a coo
[19:27:57] <ebernhardson>	 k.book that checks either alertmanager or directly some metric on prometheus and then does what"
[19:28:32] <ebernhardson>	 if they are frequent though we probably need to give the instances more memory
[19:28:35] <ebernhardson>	 maybe +1GB
[19:30:07] <ebernhardson>	 the JVM heap - young pool graph there does look like it's running out of memory, old pool steadly increasing while young pool gets less and less space to work with
[19:30:42] <ebernhardson>	 it's not clear where it would flatline though, if memory memory would be sufficient
[19:31:05] <ebernhardson>	 s/memory memory/more memory/
[19:38:08] <ryankemper>	 Just realized I banned it with transient but should probably switch to permanent now
[19:39:41] * ebernhardson wonders where all the memory goes...it's not captured in the various graphs were report there.  Maybe could take a snapshot of localhost:9200/_nodes/_local/stats/jvm each time an instance gets into memory trouble to compare later, but most of that is probably already in prometheus
[19:48:40] <ebernhardson>	 poking values suffixed _bytes from node stats..nothing particularly suspicious.  buffer pool used bytes seems a bit high, but i think thats off-heap memory
[19:49:15] <ebernhardson>	 (by bit-high, i mean 1066 had the second-highest value in the cluster, per max(elasticsearch_jvm_buffer_pool_used_bytes{exported_cluster="production-search-psi-eqiad"}) by (name))