[00:10:35] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [02:50:03] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [04:10:35] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [06:50:03] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [06:51:00] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10737173 (10ayounsi) Thanks! Based on this comment : https://github.com/openconfig/gnmic/issues/498#issuecomment-2263694440 I g... [07:15:09] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (90.44%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:26:55] ^ that will recover soon, once https://phabricator.wikimedia.org/T391243 I'll rebalance ganeti/row B [08:00:09] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (90.55%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [08:10:35] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [09:06:18] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops, and 2 others: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10737609 (10ABran-WMF) This first iteration is still fairly manual but will give us a stepping stone to build upon.... [09:43:37] topranks: looks like the answer is 16 [09:43:43] https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=eqiad&from=now-3h&to=now&refresh=30s&viewPanel=110 [09:44:01] I thought the answer was 42 ;P [09:44:30] hahaha, I should have went to that directly [09:44:42] that's awesome..... what does this represent exactly? [09:45:02] topranks: https://phabricator.wikimedia.org/T388641#10737173 [09:45:05] we've reduced the number of go routines by a factor of 100 ? [09:45:19] yeah, by bumping the number of workers [09:45:34] looks like the CPU load is still fine on the host [09:45:39] sure yep I seen that, I'm just trying to work out what exactly the number of go routines represents [09:45:44] that also reduced the memory usage [09:45:45] or what the optimal amount is.... [09:45:55] yeah, that's the question [09:45:57] at the end of the day if it works it works I'm happy with that [09:46:23] at least we've a better understanding of what's happening, and some metrics (other than just 'failures') to tell us we're hitting limits [09:46:46] last time just increasing the workers but not really understanding why it helped did not sit well with me [09:47:00] given we still seemed to have idle gnmic threads running [09:47:39] that comment gives a beginning of answer, but it's not enough for me to understand it properly https://github.com/openconfig/gnmic/issues/498#issuecomment-2265748750 [09:47:54] I mean that one https://github.com/openconfig/gnmic/issues/498#issuecomment-2263694440 [09:48:02] yeah I seen you linked that, I remember reading it before but it was above my head [09:48:21] as in I didn't know if "more workers, less go routines" was better than "less workers, more go routines" [09:48:35] ah yeah [09:48:48] but I guess it's clear that we want less go routines here [09:48:49] looks like everybody is just guessing a random number they're fine with :) [09:48:57] yeah [09:48:57] at the sites it's working well we've a few hundred, in eqiad 30k [09:49:02] heh yeah [09:49:13] at least we've a target now and can keep an eye on the go routines [09:49:20] if it starts rising we can add workers? [09:50:03] yeah, moritzm told me he could give us 2 extra vCPU for free to use on netflow1002 if we need them :) [09:50:27] don't let him short change you - get 4! [09:50:31] :) [09:50:50] so the day we hit a CPU limit by raising too much the workers, we can have to extra capacity there [09:50:57] yeah but that bit we can scale at least, the problem was its hard to make a case if overall cpu is ok [09:51:04] yep [09:51:47] and we could consider bare metal, or kubernetes or something also even. but at least we understand our requirement is cpu cycles [09:52:00] yeah [09:52:23] there is a definite non-linerarity there [09:52:51] we increased the number of workers by 2. but we decreased the go routines by 100. [09:53:09] yeah there is a threshold [09:53:10] kind of looks like if there aren't enough workers something goes wrong and it spwans more and more go routines? [09:53:24] increasing from 8 to 12 didn't change much, but 12 to 16 fixed it all [09:53:33] heh [09:53:34] yeah exactly, that's what I think [09:53:39] I wonder if we should go up in powers of two [09:54:36] but seems ok for now yeah, let's hope the other graphs start showing that as the day goes on.. [09:54:48] what would be useful maybe, is a metric on the number of workers being used [09:54:49] do you have the link to fillipos graph? not finding it for some reason [09:55:28] would it have showed that we were using the 8 workers 100% of the time? [09:55:41] and now like 15 out of the 16 on average [09:57:10] would what have showed that? [09:57:41] the number of gnmic threads running at the os level has not changed I see [09:58:00] the number of workers being used [09:58:06] oh ok that's interesting [09:58:15] and there are still 3-4 at 0% at any time [09:58:34] interestingly the busy ones seem to be peaking at like 50-60% now, whereas before they hit 80/90/100 even [09:58:44] this is just anecdotal looking at htop [09:58:58] topranks: https://w.wiki/Dkgc [09:59:12] but yeah seems like "num_workers" in the output doesn't relate to process threads [09:59:52] so my previous theory that the fact we had idle threads meant increasing workers wouldn't help was wrong [10:00:14] not that I was sure that was the case but was one thought [10:00:20] it's weird to have idle threads through [10:00:27] plus the 8 process threads were not because num_workers=8 [10:00:53] it's mostly ok I think, netflow1001 has 4 CPU cores so if it's happy with 4 threads that's probably ok [10:01:23] seems clear to my untrained eyes there is scheduling/stuff happening t the go level the os-view doesn't tell us [10:02:13] XioNoX: ok wow that graph tells the story right there [10:11:05] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10737894 (10ayounsi) Bumping to 12 didn't have the expected result, but to 16 seems to have solved the immediate issue. {F59060... [10:41:34] * elukey lunch [10:42:14] 10CAS-SSO, 10Bitu, 06Infrastructure-Foundations, 10Phabricator, 13Patch-For-Review: Phabricator should use IDP for developer account logins - https://phabricator.wikimedia.org/T377061#10738020 (10SLyngshede-WMF) @Aklapper I've merged the Phabricator Dev service configuration for IDP. Can you drop me an... [10:50:03] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [11:37:40] I need some help troubleshooting a private pupppet hiera issue. I'd like to use the hiera key `profile::ceph::s3::client::apus_keys` from private puppet in `profile::gitlab`. But puppet can't find the key (neither from labs/private nor from real private. [11:37:41] `sudo puppet lookup --node gitlab1003.wikimedia.org --compile --explain profile::ceph::s3::client::apus_keys` returns `Function lookup() did not find a value for the name 'profile::ceph::s3::client::apus_keys'` [11:46:36] jelto: I think that's because the key is defined into a profile-specific file (hieradata/common/profile/ceph/apus_keys.yaml) and hence not loaded when you compile your profile [11:47:04] I guess you could use the "%{alias(...)}" syntax to alias the value in a file loaded by your compilation [11:47:38] see also the lookup hierarchy in https://wikitech.wikimedia.org/wiki/Puppet/Hiera [11:52:18] ah that makes sense, thank you! So I could try something like [11:52:18] `profile::gitlab::s3_credentials: "%{alias('profile::ceph::s3::client::apus_key')}"` ? [11:57:02] yes I think so in hieradata/common/profile/gitlab.yaml [12:07:28] I'll try that thank you! [12:09:26] jelto: you should be able to check it with PCC sending a patch for the labs/private repo and one for puppet and running pcc passing the GERRIT_PRIVATE_CHANGE_NUMBER [12:09:43] to compile with your unmerged public privare repo patch [12:13:25] The dummy value is already in labs/private. I'll try to verify that with PCC, thank you. But I'm fighting another puppet issue regarding unresolved types currently [12:13:39] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [12:13:41] ack :) [13:09:31] volans: is there some special trick when aliasing a more complex data structure? My alias fails with "parameter 'object_storage_credentials' expects a Ceph::S3::Account ... value, got String" [13:09:31] See the -1 in the newest change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136359 [13:14:58] mmmh jelto from the docs "The alias function lets you reuse Hash, Array, Boolean, Integer or String values." so your type is a hash and should work in theory, let me chck [13:15:05] [1] https://www.puppet.com/docs/puppet/7/hiera_merging#interpolation_functions-alias-function [13:15:50] I would have put the new key in the private repo btw [13:17:24] I'm not sure if the fact that we have a custom type breaks something, maybe jhathaway knows more [13:18:48] I see it used with a list of custom types for example [13:22:27] what do you mean by utting the new key in private repo? In https://gerrit.wikimedia.org/r/c/labs/private/+/1132643 I added profile::ceph::s3::client::apus_keys [13:26:08] I would have probably put the alias one too in the private repo, but up to you, not sure if we have a standard tbh [13:53:51] 10Mail, 06Infrastructure-Foundations: Trouble reaching Microsoft email domains - https://phabricator.wikimedia.org/T390307#10738970 (10nisrael) Thank you Jesse I appreciate the help. I just got access to SNDS last week from Brendan! [14:53:39] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [15:03:29] I've sent the homer-deploy patch to release 0.9.0 if yuou want to double check [15:09:12] volans: can you deploy https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1136150 at the same time ? [15:11:41] sure [15:11:46] <3 [15:11:56] can be merged? [15:12:00] Just done [15:12:21] <3 [15:23:12] topranks, XioNoX: ok to release homer? are you running anything? [15:23:36] volans: nope not me [15:23:49] volans: go for it! [15:23:55] great, proceeding [15:26:59] ok 0.9.0 deployed, I'm running the usual diff * to be on the safe side, but the best test would be to commit something that is the same on multiple hosts [15:45:20] as a side benefit we should not get anymore 100 times: [15:45:20] WARNING:homer.capirca:Netbox capirca.GetHosts script is > 3 days old. [15:45:23] in the homer emails :D [15:47:09] lmk if you have any issue with any of the next commits [16:05:00] jelto: did you get the alias issue sorted out, can't tell from the patches? [16:07:38] msw2-codfw.mgmt.codfw.wmnet keeps failing fwiw [16:10:06] XioNoX: I think that your patch generates some diffs, not sure if those needs to be committed or not [16:10:26] or we had already pending diffs [16:12:04] cr*-ulsfo.wikimedia.org is a good candidate to test the 'all' new commit feature (two line diff) [16:12:25] jhathaway: unfortunately not, I'm still debugging the issue of missing hiera keys and types. I'll move the hiera key around a bit to match https://wikitech.wikimedia.org/wiki/Puppet/Hiera#wmflib::expand_path. But not before tomorrow [16:13:39] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [16:16:08] ok sounds good jelto, an alias function can only alias data that is already viewable by the node, so you wouldn't be able to alias data from another profile, unless that profile was included in your role [16:17:44] okay thanks for the clarification :) [16:19:03] jhathaway: really? [16:19:16] pretty sure [16:19:38] but I can test and re-verify my memory, I don't use aliases very often [16:22:40] my understanding was that the main use case was to alias hiera from other parts of hiera because they were not accessible, otherwise a lookup() would already work no? [16:24:44] yes, a normal lookup would work, the use case is avoiding logic in puppet, "looks up a key using Hiera, and uses the value as a replacement for the enclosing string. The result has the same data type as what the aliased key has - no conversion to string takes place if the value is exactly one alias" [16:25:02] so it is still a regular hiera lookup, it uses the same hiera config [16:27:28] je.lto, is that's correct I'm sorry to have sent you towards the wrong route, sigh :/ My memory was fairly confident alias() in hieradata/ was meant exactly for this use case. [18:53:39] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [20:02:19] v.olans & j.elto I confirmed that a puppet alias does not allow you to access hiera data that you could not otherwise access via lookup. However, my statement about profile data was not correct. If your profile data is using the expand path hierarchy, than that data may be looked up by any node. But, if for instance you want to access data under another role, that is not [20:02:21] possible, since our hiera.yaml interpolates the role value for a node. [20:13:39] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [21:04:15] 10Mail, 06Infrastructure-Foundations: Trouble reaching Microsoft email domains - https://phabricator.wikimedia.org/T390307#10741429 (10jhathaway) >>! In T390307#10738970, @nisrael wrote: > Thank you Jesse I appreciate the help. I just got access to SNDS last week from Brendan! Oh, great. I haven't heard anyth... [21:16:00] 07Puppet, 06Infrastructure-Foundations, 10Keyholder, 06SRE, 13Patch-For-Review: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10741473 (10jhathaway) 05Open→03Resolved a:03jhathaway [22:53:39] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts