[08:45:15] ^-- should that email address work? It's what's listed at https://office.wikimedia.org/wiki/Team_interfaces/SRE_-_Infrastructure_Foundations/Contact too [08:52:05] Emperor: it's already sorted, I sent a mail to the users, we'll create a new, separate idm-help@w.o alias for such issues and https://phabricator.wikimedia.org/T382226 to prevent that user error in the future [08:52:16] user, not users [08:52:51] 👍 [14:20:00] (caveat: navtiming.py is unowned after the reorg which remains unresolved and I shouldn't be looking at this) [14:20:02] I noticed: [14:20:07] > FIRING: [3x] NavtimingStaleBeacon: No Navtiming CpuBenchmark messages in 80d 16h 50m 46s - https://wikitech.wikimedia.org/wiki/Performance/Runbook/Webperf-processor_services - [14:20:34] but it seems fine at the in-take: [14:20:35] https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=CpuBenchmark&from=now-6M&to=now [14:21:04] https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-topic=eventlogging_CpuBenchmark&from=now-90d&to=now-5m [14:21:21] The alert comes from the intermediary processing service that takes it from kafka to prometheus [14:21:57] but the prometheus output of webperf_cpubenchmark_* metrics looks fine as well. https://grafana-rw.wikimedia.org/d/cFMjrb7nz/cpu-benchmark?orgId=1&viewPanel=15&editPanel=15 [14:22:38] ref https://www.mediawiki.org/wiki/Developers/Maintainers#Services_and_administration [14:29:52] Krinkle: the definition of the alert should be https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-perf/webperf.yaml#13 [14:32:16] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20webperf_latest_handled_time_seconds%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D [14:32:19] I see. [14:32:21] Okay, so... [14:32:26] basically processing switched from eqiad to codfw [14:32:43] so time() - max(webperf_latest_handled_time_seconds{schema="CpuBenchmark"}) is 41.etc.. [14:32:57] after /3600, it's <1 for the codfw ones [14:33:00] the eqiad ones are increasing [14:33:10] which is expected since there is an etcd lock to only be active in one DC at a given time [14:33:17] it follows the MW switch over automatically [14:33:21] or is supposed to anyway [14:33:33] in theory max() should hide that [14:34:18] if you check in https://thanos.wikimedia.org/ you can see the values for both clusters (webperf_latest_handled_time_seconds{schema="CpuBenchmark"}) [14:34:20] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%28time%28%29%20-%20max%28webperf_latest_handled_time_seconds%29%20by%20%28schema%29%29%20%2F%203600%22,%22range%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D [14:34:36] Yeah, my first link above keeps each schema and site separately [14:34:46] the second does the max by schema, thus picking the "latest" site implicitly. [14:35:02] Does the alert not use thanos and/or does it run it separately by site? [14:35:14] This second link shows no value above 1.0h [14:35:23] yet it alerts claimig to be over 80 days [14:37:36] Krinkle: if you check https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DNavtimingStaleBeacon you'll see the label "eqiad" [14:37:55] so yes it should be separate by site [14:38:12] ok, what if the business requirement is to not separate it by site? [14:38:38] I'm guessing there's a supression from a year ago holding back the codfw variant [14:40:15] the only solution I'm aware of is to move this (back) to alert via Grafana, which I'd probably prefer anyway, but I suppose that's up for the next owner to decide. [14:41:14] I think that observability can help, but about the ownership - since when navtiming is unowned? Was is communicated to SRE? (First time that I hear it) [14:44:35] elukey: since perf team was disbanded. [14:45:15] most of the things that CPT and Perf owned prior to July 2023 have been unowned since then. [14:45:19] okok, it would be nice that a reorg would take care of things like those though [14:45:33] anyway, I know it is not the perf's team fault :) [14:48:40] Krinkle: if we have a prometheus metric indicating the active MW DC, this is pretty easy [14:52:07] cdanis: hm.. something like {site=$active_dc}? [14:53:34] or is there some alertmanager-level way to make it conditional from one metric value to another? [14:53:50] I assumed that if the former was an option, max() would suffice [14:54:05] which we use already, but it's alerting on the raw ops source instead of via thanos. [14:54:24] in grafana I can set the alert to use the Thanos data source as we do for most MW-related alerts already [15:14:10] Krinkle: I skimmed the backlog though if you need alerts.git evaluated by thanos rather than prometheus you can use # deploy-tag: global in the alert file [15:14:47] TIL! [15:14:51] ok, and we can split the yaml file then I guess along that axis? [15:14:59] one for global and one for ops/dc [15:15:18] arclamp runs active-active for example [15:16:32] Krinkle: yes that would work [15:40:08] TIL global as well! [15:42:28] last 2 hosts I reimaged (cloudelastic1011 and 1012, both Bullseye) came up with Puppet 5, even though I specified Puppet 7. I didn't set the hiera as recommended by the cookbook, but I thought we were defaulting to Puppet 7 regardless now? Just wanted to make sure it doesn't have anything to do with UEFI [15:45:53] <_joe_> maybe we should just retire arclamp, if there is no investment into it. [15:46:08] inflatador: o/ you need to set hiera yes :) [15:48:18] elukey: but hieradata/role/common/elasticsearch/cloudelastic.yaml:profile::puppet::agent::force_puppet7: true [15:48:30] and insetup also have all p7 by default, so that seems weird [15:49:13] they are insetup::search_platform [15:49:20] maybe I messed up the regex or something? [15:52:21] insetup::search_platform defaults to P7 [15:52:21] volans: I didn't check, I thought from what inflatador wrote that no hiera was set, this is why I assumed it was missing [15:52:26] weird [15:52:31] FWiW, I recently reimaged wdqs1025 (also EFI) and it didn't have this problem AFAICT [15:55:35] yeah, on Dec 6th, I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101095 which enabled EFI for wdqs1025 and I reimaged it while it was in insetup. The reimage finished successfully at 2024-12-09 18:06:06 based on cumin2002 logs [15:58:31] note that cloudelastic1011/1012 are the 1st hosts to use a partman recipe I wrote ( see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103367 ) , but that shouldn't matter, right? [16:00:54] these are also our first Supermicro chassis. Anyway, the hosts aren't in service so if y'all wanna try and reimage again LMK [16:04:50] inflatador: in theory vendor and partman recipe shouldn't matter [16:04:56] inflatador: what makes you say it has puppet 5? [16:05:09] ii puppet 7.23.0-1~debu11u1 all transitional dummy package [16:07:08] volans I fixed it myself via `install-console`...they came up w/out the Puppet 7 repo I had to manually add, run `puppet agent`, sign CSR etc [16:07:28] you shouldn't do those things manually [16:07:53] elukey: I wonder if this is the boor order problem that it actually did d-i twice [16:08:09] I agree, I prefer to not to [16:08:24] hmm, could be [16:09:36] volans: in theory though every time I saw double d-i then at the CRS the reimage procedure got stuck [16:09:39] or failed [16:09:56] inflatador: did it complete without errors? The reimage I mean [16:10:19] IIUC you fixed it manually and killed the reimage right? [16:10:38] if so it is probably double d-i, and it would fit [16:14:21] inflatador: if this re-happens (namely, reimage stuck etc.. at first try) just control-C the cookbook and then kick off a new reimage [16:14:41] mmm something definitely wrong on some nodes [16:14:52] `Error: Failed to apply catalog: Cannot convert '""' to an integer for parameter: number_of_facts_soft_limit` [16:15:38] maybe related to 4b79506d1159d85cdd630116b098001170aece76 ? [16:16:11] andrewbogott: wdyt? [16:16:56] fabfur: that's for sure my patch but I don't know why it's happening to you... What host is it happening on? [16:17:37] currently have alerts for cp7005 and cp5028 [16:18:00] other hosts are confirmed working fine [16:18:09] (running puppet fine) [16:18:16] huh, ok, looking... [16:18:34] inflatador: if you have time can you try to reimage again the cloudelastic nodes? [16:18:56] fabfur: can I get a fqdn? I don't think I know what 7xxx means :) [16:19:11] cp7005.magru.wmnet [16:19:27] cp5028.eqsin.wmnet [16:19:44] also others at https://puppetboard.wikimedia.org/nodes?status=failed [16:19:53] but not too many, so not so wide spread [16:20:07] yeah those two popped up on traffic chan [16:21:17] so what would result in [16:21:19] lookup('profile::puppet::agent::facts_soft_limit', {'default_value' => 2048}) [16:21:22] evaluating to "" ? [16:24:08] oh, those hosts got stuck in a transitional state somehow, the puppet catalog is actually correct [16:25:32] fabfur@cp7005:~$ puppet lookup 'profile::puppet::agent::facts_soft_limit' [16:25:32] fabfur@cp7005:~$ echo $? [16:25:32] 1 [16:27:30] the fix is just [16:27:30] sudo sed -i 's/^number_of_facts_soft_limit = $//g' /etc/puppet/puppet.conf [16:27:33] on affected hosts [16:28:08] why some hosts are affected while other doesn't? [16:28:33] * andrewbogott waves hands [16:28:43] something to do with puppet altering its own config having a race? I don't know [16:28:45] hmmh, is that some race where the puppet agent picks up it's config while it's being updated [16:28:49] :D [16:29:15] I haven't seen that, but then we also rarely change central settings in the puppet.conf [16:29:16] puppet is too efficient [16:31:10] So in theory my sed is safe to run everywhere since it only clobbers that config line if it's already broken. What do you think? [16:31:58] I guess if there's really a race in applying then it could happen again after removing that line [16:32:01] but we can iterate :p [16:32:48] it's just a handful of server I'd rather run only on those [16:33:26] or use https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [16:33:50] oh, nice [16:34:57] hm I don't immediately see how to compound that with a sed [16:35:32] right [16:35:36] the list of hosts isL: [16:35:36] bast4005.wikimedia.org,cephosd2003.codfw.wmnet,cp7014.magru.wmnet,elastic2088.codfw.wmnet,es2026.codfw.wmnet,logstash[1027,1036].eqiad.wmnet,mc1051.eqiad.wmnet,wikikube-worker2144.codfw.wmnet,wikikube-worker[1002,1080].eqiad.wmnet [16:35:43] oh nice, thank you [16:36:04] just grepped that lkine on all and got the ones empty [16:36:22] for future reference... does "run-puppet-agent -q --failed-only" return true/false depending on if the -q is activated? [16:38:29] I ran that sed, now re-running puppet on the 11 hosts [16:40:19] ok, should be all fixed. Thanks fabfur & volans [16:40:30] thx [16:40:31] thanks to you for the fix! [16:40:43] * andrewbogott thinks it's too much to expect the pcc to say "this host compiles without errors 99.7% of the time' [17:22:00] <_joe_> andrewbogott: not sure I follow [17:22:53] <_joe_> compiling a catalog is deterministic, do you mean some history of changes? [17:23:08] _joe_: We ran into an issue that seems to be an occasional race with applying puppet config, I was just making a crack that there's no way the pcc could have predicted it. [17:23:26] <_joe_> andrewbogott: pcc does nothing for applying the config, it just compiles the catalog [17:23:32] exactly :) [17:23:35] <_joe_> so other things you won't know are dependency cycles [17:23:55] <_joe_> ahh ok I thought it was a feature request :P [17:24:35] Nope, we could definitely not have tested for this ahead of time :D [17:30:43] elukey sorry to ghost, minor failure emergency which is over now ;) . both hosts' reimages stalled/failed at `The puppet server has no CSR for ${fqdn}` [17:30:54] err...family emergency that is [17:31:06] anyway, I can reimage cloudelastic1012 and let you know how it goes [19:30:27] inflatador: np! Yeah the failure in CSR is a sign of double debian install, same issue that we are seeing on other supermicro nodes sigh [19:30:39] I'll reimage 1011 tomorrow as well to check! [19:32:07] all info in https://phabricator.wikimedia.org/T381919