[08:28:57] <Amir1>	 marostegui: let me know if I should do it. Also we can run it in eqiad replicas when they are depooled
[08:29:33] <marostegui>	 yeah, we can plan for it once the DC switch is done 
[08:29:41] <marostegui>	 eqiad will only be depooled for a week 
[15:09:14] <dr0ptp4kt>	 i noticed https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=clouddb1019&service=MariaDB+memory via alerts.wikimedia.org . i was thinking maybe we ought to lower the innodb_buffer_pool values a little in clouddb1019.yaml . definitely it's always cutting it close for the 95% warning threshold, even when queries aren't really brutal . is that okay in this context?
[15:10:12] <dr0ptp4kt>	 i realize the mariadb docs suggest a 70% buffer pool, but i see we (probably?) set it higher usually, i'm assuming to avoid disk paging as much as possible. but here these feel like false positives at the moment (assuming we don't have memory leaks in mariadb somewhere)
[15:10:53] <dr0ptp4kt>	 grafana: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1021&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m
[15:24:08] <dr0ptp4kt>	 oops. meant https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1019&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m (clouddb1019)
[15:46:11] <dr0ptp4kt>	 (er, innodb_buffer_pool_size , too, for the var name)
[15:51:54] <jynus>	 there is no perfect answer, the 70% is a general advice, but it will depend on the resources and the type of traffic
[15:52:34] <jynus>	 what you have to maximize is the amount of data cached while allowing memory for per-connection execution (which for clouddbs will be different from production, as the patterns are different)
[15:53:12] <jynus>	 we generally use 512GB machines, so it doesn't necesarilly scale linearly
[15:53:41] <jynus>	 maybe for cloud db the buffer pool should be smaller as it has more user activity and longer-running, high-memory consuming queries?
[15:54:34] <jynus>	 in reality the way to decide is trial an error- the 95% is a useful thing to see if there is overcomitment on memory, but what really should be used to decide is the amount of swapping happening
[16:15:18] <jynus>	 Amir1: re- candidate migration- think that we can move one host from one section to another, or from misc to mw, etc. I personally don't like giving more work to DCops if it can be avoided
[16:16:16] <Amir1>	 yeah, we can keep that in mind too. swap hosts
[16:19:32] <jynus>	 think it is not just physically moving them, it is retagging, changing network switch, ip, etc
[16:19:52] <jynus>	 it has been done in the past, but only when there was no other option
[16:21:36] <dr0ptp4kt>	 thanks jynus . so i think in terms of monitoring, then, it would chronic memory saturation and disk metrics to watch. i do see the s4 and s6 wikis on clouddb1019 are probably high density targets (commonswiki, frwiki, jawiki, ruwiki, labswiki) - e.g., image table at 73m rows so even simple table scans are big - but presumably not memory intensive, but of course other stuff might be and could need to page easily.
[16:23:42] <jynus>	 dr0ptp4kt my suggestion is if you see a host in warning, restart it , it it reocurrs it may be a configuration issue or a memory leak
[16:23:50] <jynus>	 *if
[16:31:18] <dr0ptp4kt>	 Amir1 any thoughts on this for innodb_buffer_pool_size for the wiki replicas in addition to what jynus said? i was thinking i could submit a patch, but i'd necessarily need your help bouncing the thing...also happy to put in for ops membership and do a bounce myself, obviously under your supervision. i realize that's a good candidate for a sudo entry probably later on.
[16:31:40] <dr0ptp4kt>	 (not suggesting a bounce on a friday, to be clear!)
[16:33:35] <Amir1>	 I think there should be some space left aside in memory for temp tables. I'm not sure, but that was my impression of what it was the reason for 70%
[16:34:08] <Amir1>	 I don't think reducing buffer cache could cause issues in clouddb, it's a clouddb and by nature slow
[16:34:47] <Amir1>	 generally speaking, better to put this in a ticket
[16:35:51] <dr0ptp4kt>	 😸 - got it, cool, will do
[17:15:12] <marostegui>	 dr0ptp4kt: This is a weird situation as we do not own the hosts/service. Ideally we want to have as much cached tables/data as we could. Changing the buffer pool, although, unlikely unless a big change, might have some performance implications for those running their tools there. I'd suggest you talk to WMCS about it. Meanwhile, if they consider that warning something that needs immediate action, they can alway 
[17:15:12] <marostegui>	 depool the host and issue a Mariadb restart for now (but it will come up again, it is just a matter of time). I simply don't want be touching things now that the responsibilities are so unclear 
[17:22:06] <dr0ptp4kt>	 Thanks marostegui , thanks all, appreciate it!
[17:22:11] * dr0ptp4kt Have a good weekend!
[17:22:23] <dr0ptp4kt>	 Have a good weekend!
[17:22:35] <marostegui>	 you too!
[19:55:08] <RhinosF1>	 dr0ptp4kt: fyi, your meta user page has flastname@wikimedia.org as your email
[20:01:58] <dr0ptp4kt>	 RhinosF1: ha, yes, trying to make customer success people emailing me about their tool du jour work for it...although it is a losing battle. do you think i should update it to the real thing?
[20:02:35] <RhinosF1>	 dr0ptp4kt: I would just use the real thing
[21:24:09] <dr0ptp4kt>	 RhinosF1: done, thanks - have a good weekend.
[21:27:08] <RhinosF1>	 :)