[07:50:10] (GanetiMemoryPressure) firing: Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [09:50:10] (GanetiMemoryPressure) firing: (2) Ganeti: High memory usage on ganeti5004:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [09:58:07] moritzm: ^ [09:58:29] No that me [09:58:39] Me who can't do math [09:59:03] It's currently alerting because of "to low memory" usage, I got the alert wrong [09:59:20] The updated version is deployed and should pick up soon [09:59:52] slyngs: the updated versions changes the alert string to "I'm bored" ? [10:00:10] (GanetiMemoryPressure) firing: (4) Ganeti: High memory usage (96.64%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [10:00:51] No, it changes it to that ^ [10:01:25] ah, looks more of a problem indeed [10:04:42] been like that for a while it seems [10:04:45] https://usercontent.irccloud-cdn.com/file/VrKEEZhH/image.png [10:06:25] it's just buffered [10:06:36] I was literally just gonna ask what that means in this context [10:07:01] that the alert and dashboard are misleading [10:07:02] I'm guessing something to do with garbage collection? [10:07:10] the host is using 16GB of RAM ofver 128 available [10:07:13] volans: :-) [10:07:22] ok [10:07:26] $ free total used free shared buff/cache available [10:07:29] Mem: 128361 16717 4301 4 107342 110613 [10:07:36] so the alert should not count the cache or buffered numbers [10:07:43] sorry for the bad alignment in the pasting [10:07:43] cool yeah makes sense [10:11:06] I'll update the alerting, once I figure out how Prometheus expects you to do that calculation [10:11:14] volans: I will try to find it my heart to forgive :) [10:11:19] slyngs I updated my dashboard there [10:11:22] https://usercontent.irccloud-cdn.com/file/zLSOteyY/image.png [10:11:32] lol :D [10:11:56] percentage calc is this: [10:12:00] ((node_memory_MemTotal_bytes{instance="$instance"} - ((node_memory_MemFree_bytes{instance="$instance"} + node_memory_Cached_bytes{instance="$instance"} + node_memory_Buffers_bytes{instance="$instance"}))) / node_memory_MemTotal_bytes{instance="$instance"}) * 100 [10:12:01] Right now I have a host with 142% memory usage, which is equally wrong [10:12:21] Oh, I forgot cached [10:13:01] what's exporting the data? (from which source) [10:13:06] used bytes = total_bytes - (free_bytes + cached_bytes + buffered_bytes) [10:13:09] do you have "available"? [10:13:30] why would they make it that easy?? [10:14:04] really used is 'used' - ('buffers' + 'cached') [10:14:09] really free is 'free' + ('buffers' + 'cached') [10:14:25] Well.... It's cloud-scale, it's suppose to be weird [10:14:29] but depends where you get the data from [10:14:42] meminfo? [10:14:44] free? [10:17:35] from the prometheus node exporter you get these [10:17:36] https://phabricator.wikimedia.org/P54794 [10:17:58] ok seems /proc/meminfo at first sight [10:18:11] you can probably do node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes [10:18:18] yeah its likely just that exported [10:18:22] see https://man7.org/linux/man-pages/man5/proc.5.html [10:18:37] (search /proc/meminfo) [10:19:12] but also, I think we already have those metrics graphed [10:19:26] so let me take a step back... what' are you trying to do that is not already done? [10:19:41] The issue with memfree is that it will give you "free" memory, but it includes buffered & file cached in "used" [10:19:55] volans: I'm trying to do nothing really [10:20:08] I was just sharing the calculation that I think is needed for the alert with Simon [10:20:26] what's the current alert doing? [10:21:05] current as in icinga-based [10:21:07] from what I can tell it's *not* calculating the right value for used memory, it's including the buffer (and possibly the cache) [10:22:08] volans: the node_memory_MemAvailable_bytes might be better actually.. [10:24:38] yeah that seems to work [10:24:51] (node_memory_MemTotal_bytes{instance="$instance"} - node_memory_MemAvailable_bytes{instance="$instance"}) / node_memory_MemTotal_bytes{instance="$instance"} [10:25:57] or just available/total < X [10:32:35] See, that's why volans is the smart one. Don't bother doing complex stuff, reverse the calculation :-) [10:33:26] ah no that doens't work for my brain [10:33:53] somehow an alert for "5% free memory" doesn't ring the alarm like "95% memory used"! [10:35:11] Oh, yeah, no I didn't change it, I just think its clever :-) [10:36:05] well tbh I was over-complicating everything :) [10:45:19] rotfl [11:20:10] (GanetiMemoryPressure) firing: (3) Ganeti: High memory usage (96.7%) on ganeti3008:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:22:30] ^Just waiting for alert manager to pick up the new query/rule [11:25:10] (GanetiMemoryPressure) resolved: Ganeti: High memory usage (99.46%) on ganeti6002:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=drmrs - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:39:34] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:44:08] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:47:19] 10netops, 10Infrastructure-Foundations, 10cloud-services-team: Remove cloud-support1-c-eqiad VLAN - https://phabricator.wikimedia.org/T355115 (10taavi) 05Open→03Resolved a:03taavi [13:47:58] 10netops, 10Infrastructure-Foundations, 10cloud-services-team: Remove cloud-support1-c-eqiad VLAN - https://phabricator.wikimedia.org/T355115 (10taavi) [14:19:27] topranks, slyngs: just to be picky about this morning chat, depending on how prometheus optimize the query using available/total might still be better to avoid querying total twice. And if you don't want to invert the logic you can just do 1 - available/total ;) [14:19:52] volans: oh yeah 100% [14:20:23] I would reverse it with the 1- alright, but that's a matter of taste [14:28:50] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) @BTullis I've merged this patch and stumbled upon: ` Jan 17 14:24:55 dborch1001 orchestrator[981119]: ReadTopologyInstance(dbstore1008.eqiad.wmnet:33... [14:29:43] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10ABran-WMF) p:05High→03Medium [15:03:43] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465806, @ABran-WMF wrote: > @BTullis I've merged this patch and stumbled upon: > > ` > Jan 17 14:24:55 dborch1001 orchestrator[981119]:... [15:04:10] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) All sections should have orchestrator grants [15:13:51] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10BTullis) >>! In T355157#9465960, @Marostegui wrote: > All sections should have orchestrator grants Apologies for being vague. Yes, all sections on both of these... [15:20:33] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate MariaDB PKI - https://phabricator.wikimedia.org/T355157 (10Marostegui) That should be fine, I can see both hosts in orchestrator fine. Not sure if something was done. Also please add root grants from cumin1002 cc @ABra... [15:26:17] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:47:40] hello! I am trying to reimage a host and it's failing to pxe at all - tcpdumping on the install server shows absolutely no traffic from the new host whatsoever. Any suggestions for where I should look next? [17:04:29] hnowlan: check nic firmware version compatibility with the OS you want to install with dcops [17:04:42] (not sure if it's mentioned in wikitech too) [17:05:26] volans: ah, thanks [17:33:15] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.7.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) https://docs.netbox.dev/en/stable/configuration/miscellaneous/#enforce_global_unique `ENFORCE_GLOBAL_UNIQUE` - The default value for this parameter was changed from Fals...