[07:50:10] <jinxer-wm>	 (GanetiMemoryPressure) firing: Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[09:50:10] <jinxer-wm>	 (GanetiMemoryPressure) firing: (2) Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[09:58:07] <XioNoX>	 moritzm: ^
[09:58:29] <slyngs>	 No that me
[09:58:39] <slyngs>	 Me who can't do math
[09:59:03] <slyngs>	 It's currently alerting because of "to low memory" usage, I got the alert wrong
[09:59:20] <slyngs>	 The updated version is deployed and should pick up soon
[09:59:52] <XioNoX>	 slyngs: the updated versions changes the alert string to "I'm bored" ?
[10:00:10] <jinxer-wm>	 (GanetiMemoryPressure) firing: (4) Ganeti: High memory usage (96.64%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[10:00:51] <slyngs>	 No, it changes it to that ^
[10:01:25] <XioNoX>	 ah, looks more of a problem indeed
[10:04:42] <topranks>	 been like that for a while it seems 
[10:04:45] <topranks>	 https://usercontent.irccloud-cdn.com/file/VrKEEZhH/image.png
[10:06:25] <volans>	 it's just buffered
[10:06:36] <topranks>	 I was literally just gonna ask what that means in this context 
[10:07:01] <volans>	 that the alert and dashboard are misleading
[10:07:02] <topranks>	 I'm guessing something to do with garbage collection?
[10:07:10] <volans>	 the host is using 16GB of RAM ofver 128 available
[10:07:13] <slyngs>	 volans:  :-)
[10:07:22] <topranks>	 ok
[10:07:26] <volans>	 $ free total        used        free      shared  buff/cache   available
[10:07:29] <volans>	 Mem:          128361       16717        4301           4      107342      110613
[10:07:36] <topranks>	 so the alert should not count the cache or buffered numbers 
[10:07:43] <volans>	 sorry for the bad alignment in the pasting
[10:07:43] <topranks>	 cool yeah makes sense 
[10:11:06] <slyngs>	 I'll update the alerting, once I figure out how Prometheus expects you to do that calculation
[10:11:14] <topranks>	 volans: I will try to find it my heart to forgive :)
[10:11:19] <topranks>	 slyngs I updated my dashboard there 
[10:11:22] <topranks>	 https://usercontent.irccloud-cdn.com/file/zLSOteyY/image.png
[10:11:32] <volans>	 lol :D
[10:11:56] <topranks>	 percentage calc is this:
[10:12:00] <topranks>	 ((node_memory_MemTotal_bytes{instance="$instance"} - ((node_memory_MemFree_bytes{instance="$instance"} + node_memory_Cached_bytes{instance="$instance"} + node_memory_Buffers_bytes{instance="$instance"})))  / node_memory_MemTotal_bytes{instance="$instance"}) * 100
[10:12:01] <slyngs>	 Right now I have a host with 142% memory usage, which is equally wrong
[10:12:21] <slyngs>	 Oh, I forgot cached
[10:13:01] <volans>	 what's exporting the data? (from which source)
[10:13:06] <topranks>	 used bytes = total_bytes - (free_bytes + cached_bytes + buffered_bytes)
[10:13:09] <volans>	 do you have "available"?
[10:13:30] <topranks>	 why would they make it that easy??
[10:14:04] <volans>	 really used is 'used' - ('buffers' + 'cached')
[10:14:09] <volans>	 really free is 'free' + ('buffers' + 'cached')
[10:14:25] <slyngs>	 Well.... It's cloud-scale, it's suppose to be weird
[10:14:29] <volans>	 but depends where you get the data from
[10:14:42] <volans>	 meminfo?
[10:14:44] <volans>	 free?
[10:17:35] <topranks>	 from the prometheus node exporter you get these 
[10:17:36] <topranks>	 https://phabricator.wikimedia.org/P54794
[10:17:58] <volans>	 ok seems /proc/meminfo at first sight
[10:18:11] <volans>	 you can probably do node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes
[10:18:18] <topranks>	 yeah its likely just that exported 
[10:18:22] <volans>	 see https://man7.org/linux/man-pages/man5/proc.5.html
[10:18:37] <volans>	 (search /proc/meminfo)
[10:19:12] <volans>	 but also, I think we already have those metrics graphed
[10:19:26] <volans>	 so let me take a step back... what' are you trying to do that is not already done?
[10:19:41] <topranks>	 The issue with memfree is that it will give you "free" memory, but it includes buffered & file cached in "used"
[10:19:55] <topranks>	 volans: I'm trying to do nothing really 
[10:20:08] <topranks>	 I was just sharing the calculation that I think is needed for the alert with Simon 
[10:20:26] <volans>	 what's the current alert doing?
[10:21:05] <volans>	 current as in icinga-based
[10:21:07] <topranks>	 from what I can tell it's *not* calculating the right value for used memory, it's including the buffer (and possibly the cache)
[10:22:08] <slyngs>	 volans: the node_memory_MemAvailable_bytes might be better actually..
[10:24:38] <topranks>	 yeah that seems to work 
[10:24:51] <topranks>	 (node_memory_MemTotal_bytes{instance="$instance"} - node_memory_MemAvailable_bytes{instance="$instance"}) / node_memory_MemTotal_bytes{instance="$instance"}
[10:25:57] <volans>	 or just available/total < X
[10:32:35] <slyngs>	 See, that's why volans is the smart one. Don't bother doing complex stuff, reverse the calculation :-)
[10:33:26] <topranks>	 ah no that doens't work for my brain 
[10:33:53] <topranks>	 somehow an alert for "5% free memory" doesn't ring the alarm like "95% memory used"!
[10:35:11] <slyngs>	 Oh, yeah, no I didn't change it, I just think its clever :-)
[10:36:05] <topranks>	 well tbh I was over-complicating everything :)
[10:45:19] <volans>	 rotfl
[11:20:10] <jinxer-wm>	 (GanetiMemoryPressure) firing: (3) Ganeti: High memory usage (96.7%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure  - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:22:30] <slyngs>	 ^Just waiting for alert manager to pick up the new query/rule
[11:25:10] <jinxer-wm>	 (GanetiMemoryPressure) resolved: Ganeti: High memory usage (99.46%) on ganeti6002:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[11:39:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:44:08] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[13:47:19] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10cloud-services-team: Remove cloud-support1-c-eqiad VLAN - https://phabricator.wikimedia.org/T355115 (10taavi) 05Open→03Resolved a:03taavi
[13:47:58] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10cloud-services-team: Remove cloud-support1-c-eqiad VLAN - https://phabricator.wikimedia.org/T355115 (10taavi)
[14:19:27] <volans>	 topranks, slyngs: just to be picky about this morning chat, depending on how prometheus optimize the query using available/total might still be better to avoid querying total twice. And if you don't want to invert the logic you can just do 1 - available/total ;)
[14:19:52] <topranks>	 volans: oh yeah 100% 
[14:20:23] <topranks>	 I would reverse it with the 1- alright, but that's a matter of taste 
[14:28:50] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) @BTullis I've merged this patch and stumbled upon:  ` Jan 17 14:24:55 dborch1001 orchestrator[981119]: ReadTopologyInstance(dbstore1008.eqiad.wmnet:33...
[14:29:43] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) p:05High→03Medium
[15:03:43] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465806, @ABran-WMF wrote: > @BTullis I've merged this patch and stumbled upon: >  > ` > Jan 17 14:24:55 dborch1001 orchestrator[981119]:...
[15:04:10] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) All sections should have orchestrator grants
[15:13:51] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465960, @Marostegui wrote: > All sections should have orchestrator grants  Apologies for being vague. Yes, all sections on both of these...
[15:20:33] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) That should be fine, I can see both hosts in orchestrator fine. Not sure if something was done.   Also please add root grants from cumin1002 cc @ABra...
[15:26:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[16:47:40] <hnowlan>	 hello! I am trying to reimage a host and it's failing to pxe at all - tcpdumping on the install server shows absolutely no traffic from the new host whatsoever. Any suggestions for where I should look next? 
[17:04:29] <volans>	 hnowlan: check nic firmware version compatibility with the OS you want to install with dcops
[17:04:42] <volans>	 (not sure if it's mentioned in wikitech too)
[17:05:26] <hnowlan>	 volans: ah, thanks
[17:33:15] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) https://docs.netbox.dev/en/stable/configuration/miscellaneous/#enforce_global_unique `ENFORCE_GLOBAL_UNIQUE` - The default value for this parameter was changed from Fals...