[09:26:06] Who could potentially help with reviewing a bit of go? [09:29:57] <_joe_> jayme: kormat for sure [09:30:12] <_joe_> jayme: I'm about to embark in reading your CRs too [09:30:43] <_joe_> jayme: and IIRC klausman is also knowledgeable [09:32:00] _joe_: ah, it's probably fine with you then. I was trying to lift that cargo from you :) [09:35:39] Lumen is doing a great job, we have cr2-eqiad - cr2-esams transport down as last week, plus emergency maintenance cr1-codfw - cr4-ulsfo [09:41:20] yep... [09:41:29] esams link should be back tomorrow [10:12:41] one of the puppet compilers is out of disk space: T295253 [10:12:42] T295253: compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space - https://phabricator.wikimedia.org/T295253 [10:18:08] https://twitter.com/DEVOPS_BORAT/status/141551618192703488 [11:10:46] jbond_: is it ok if I remove some stuff from puppet-compiler/output on compiler1003? (+7d results), it's out of space [11:11:34] I think the cleanup jobs defined in https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/puppet_compiler.pp#L19 are broken [11:12:01] dcaro: I decided to be bold and clean things up myself [11:12:07] xd [11:12:51] it should be daily, but systemctl list-timers says last execution was 8 months ago [11:16:11] the underlying service seems to be faliing because it tries to remove directories [11:27:23] I think it might be due to the service being in failed state, rather than 'inactive' [11:27:31] so the timer did not trigger [11:27:46] (as it gets triggered only when the service is inactive), but I'm not sure xd [11:28:07] <_joe_> dcaro: that is incorrect, at least as far as I remember [11:28:23] what partV [11:28:26] ? [11:28:41] <_joe_> the unit being in a failed state doesn't prevent further runs, unless you set a specific configuration option which I don't think we do [11:29:13] it's not preventing further runs, but the timer has OnUnitInactive as the trigger, and I think failed counts as 'active' [11:29:21] so the timer is not triggering [11:29:39] <_joe_> yes, sorry, I assumed it was OnCalendar [11:30:01] <_joe_> in that case, I'm not 100% sure, let me check the docs [11:30:11] it's not clear to me either :S [11:31:18] <_joe_> because ofc the behaviour is different between OnUnitActiveSec and OnUnitInactiveSec [11:32:43] <_joe_> dcaro: uhm I actually used OnUnitInactiveSec and I see the timer firing even if the previous run failed [11:32:44] looking at the code (on which I don't have much experience) it seems to distinguish between active, inactive, failed, reloading, and such states as different [11:33:37] interesting [11:34:18] <_joe_> dcaro: "Defines a timer relative to when the unit the timer unit is activating was last deactivated." is what the docs state, which seems to align with my interpretation [11:35:03] point is, a unit it 'failed' state, is considered 'active' or 'inactive'? [11:35:17] <_joe_> inactive IIRC [11:35:18] <_joe_> but [11:35:21] when doing a list-units, failed ones show up, but not inactive [11:35:27] <_joe_> https://github.com/systemd/systemd/issues/6680 reports of a bug [11:35:35] when doing list-units --all, all show up, but it doen not say if failed ones are active/inactive [11:36:18] <_joe_> dcaro: where are you looking at this issue? [11:36:32] compiler1003.puppet-diffs.eqiad.wmflabs [11:37:40] <_joe_> dcaro: no I think the problem is what is reported in that bug, at least partially [11:38:10] <_joe_> delete-old-output-files.service isn't failed, but still it never fired [11:38:29] I restarted it manually, but it was failed before [11:38:36] <_joe_> oh I see [11:38:38] <_joe_> still [11:38:47] <_joe_> not scheduled [11:39:00] <_joe_> so there is another issue at play [11:41:19] <_joe_> because in the past we had an issue exactly with OnUnit[In]ActiveSec timers [11:41:36] <_joe_> that they need an initial run of the software to be activated, else they never fire [11:45:36] <_joe_> dcaro: I feel like there is something obvious here I should see and I don't [11:47:54] <_joe_> ok, I think something's wrong with the timer definitions [11:48:38] <_joe_> OnUnitInactiveSec=daily [11:48:40] <_joe_> OnActiveSec=1s [11:48:47] <_joe_> yup this is quite strange :P [11:50:42] that's coming from https://github.com/wikimedia/puppet/blob/8ec00591b29af9350b1b0ab77d90c584b108757d/modules/systemd/manifests/timer/job.pp#L162 [11:51:52] oh, grafana 8.2 added fiscal year support. https://grafana.com/docs/grafana/latest/whatsnew/whats-new-in-v8-2/#dashboards [11:52:05] <_joe_> majavah: that's not the part I was looking at as "strange" [11:52:05] can't wait to use it for wmf 🙃 [11:52:41] <_joe_> "daily" I don't remember it being a valid identifier [11:53:42] I created a timer on my laptop with the same content (changing daily for 30), and it's still not triggering xd [11:53:59] gtg for lunch, but I'll be back later, let me know if you find anyting [11:54:12] <_joe_> dcaro: yeah I have to finish a lengthy CR first [11:54:25] I think "daily" does work for OnCalendar [11:56:45] <_joe_> majavah: well me looking at `systemctl show ` says otherwise :) [11:57:07] <_joe_> compiler1002:~$ systemctl show delete-old-output-files.timer | grep Timer [11:57:09] <_joe_> TimersMonotonic={ OnActiveUSec=1s ; next_elapse=5.726537s } [11:57:21] that is not OnCalendar :P [11:57:46] <_joe_> majavah: sorry I misread [11:58:03] <_joe_> and yes daily works with OnCalendar but not with monotonic timers [12:00:13] <_joe_> majavah: https://gerrit.wikimedia.org/r/c/operations/puppet/+/737350 [12:01:24] <_joe_> also dcaro [12:01:41] <_joe_> I'm not fixing the other issues though [12:02:48] _joe_: lgtm [13:46:48] _joe_: thx [13:58:58] poor PCC really struggles when you get it to run against 30 hosts [14:04:02] kormat: PCC is configuered to use 4 threads which means it only compiles max 4 catalogs at a time i.e. two hosts (each host has a prod and change catalog). so the run time will roughly double for each addtional 2 hosts. [14:05:47] it's.. somehow quadratic? [14:05:54] it's taking 10 minutes for 30 hosts [14:12:27] kormat: currently in a meeting so can comment more later. however keep in mind not all hsts sre equal. e.g. pcc for alerts1001 is likley 10x longer then pcc for sretest [14:38:14] <_joe_> kormat: no it's linear in theory [14:38:26] _joe_: ah ok. that makes a lot more sense. [14:38:30] <_joe_> but varies depending on the size of the individual catalog [14:38:40] sure [14:39:18] <_joe_> also depends on the performance of the vm/disk/etc in the moment, which is influenced a bit by external factors [14:39:38] fyi i just checked and the num threads is only 2 (not 4) [14:40:17] <_joe_> jbond_: that is the default when using e.g. pcc and is a relic of a time when we only had one compiler I think [14:40:19] i also checked a bunch of random reports and as far as i can tell it is scalling in line with this [14:40:53] _joe_: yes we can definetly look at incerasing it. will be dependent on however many corse the vm has [14:40:59] https://puppet-compiler.wmflabs.org/compiler1002/32208/ is the latest run of mine [14:41:04] afakik it is not really realted to the number of compielr hosts [14:42:20] kormat: fyi i did a couple of runs plitting the hosts into batches of 15 hosts and it dose take ~half the time for each batch [14:42:23] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32217/ [14:42:32] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32218/ [14:43:34] <_joe_> kormat: so in your case I'd expect 30 hosts == 15 puppet compilations plus 15 diffs per thread.... I would imagine 15/20 minutes would be expected [14:44:10] <_joe_> so if it takes less I guess the limit is your patience that is quadratic :) [14:44:24] to clear up any confusion, the only reason i asked if it was somehow quadratic was based on jbond_ statement that "run time will roughly double for each addtional 2 hosts." [14:44:42] looks like the current vm hosts do have 4 cores so we should at the very least be able to up num_threads to 4 as a quick win [14:44:43] so, yeah, it seems to be pretty linear. just painfully slow. [14:44:53] <_joe_> kormat: that's puppet for you [14:44:59] <_joe_> but yes what john said [14:45:02] heh, emphasis on the pain [14:45:38] <_joe_> jbond_: I think I saw some work by dcaro and you to be able to run pcc on your own workstation, right? [14:46:11] kormat: yes badly phrased sorry i iner based on the number of hosts (teach me to respond whil in a meeting ;)) [14:46:35] _joe_: yes there is it still dosn;t solve the puppetdb issue [14:46:44] hwoever most use cases dont need puppetdb [14:47:44] in relation to puppetdb i think we may be able to right a puppet termini that some how facts responses to resource queries (e.g. exported resources) however the indirector documentation is not good even by puppet standards [14:47:59] * kormat snorts [14:48:16] <_joe_> jbond_: I think we should just make it possible for people to populate a puppetdb locally [14:48:21] <_joe_> like we do on the compiler [14:49:54] _joe_: we shuold be able to get top that point farily easily tbh. me and dcaro are looking at pcc on a weekly bases at the moment (thus/fri) so will see if we can at least add some documentation. [14:50:18] <_joe_> my original idea was to have a docker-compose recipe [14:50:27] <_joe_> to allow people to run it while doing dev work [14:51:11] yes that would be ideal but updating puppetdb becomes the issue. the populate puppetdb task takes quite some time (could probably be optimised) [14:52:06] but perhaps we could have some container which runs pcc and then connects to a public puppetdb insrtance running in cloud ??? [14:52:48] <_joe_> jbond_: heh maybe too, yep [14:53:00] <_joe_> but also, one can run that task at night tbh [14:53:14] <_joe_> we don't really need to fix everything to be perfect [14:53:25] having something in cloud would also give us a place top collect reports cenrtllly [14:53:53] but yes tru, will have a chat with david on thus. i think either way we should be able to get some sort of PoC for people to play with and bash [17:13:57] hey Italian speakers, does this look sufficiently related to Wikimedia? https://ferdinando.me/ [17:15:48] mutante: what do you mean? [17:22:15] I'm guessing someone wants it adding to the syndicated blog things? [17:22:46] ah yea, it needs to pass a test of being "related to wiki" [17:22:58] for inclusion in aggregated blog feeds [17:23:36] I do see an entire section about Wikimania now .. so I think it's a go [18:39:14] <_joe_> mutante: lemme take a look [18:40:19] <_joe_> mutante: there is one loosely related entry every 5 or so [18:42:02] <_joe_> so not sure it passes the bar tbh [18:47:43] _joe_: oh! ok, thank you. hmm [18:49:42] <_joe_> not sure means I'm really not sure, I don't know what's included in planet atm [18:59:07] we are pretty lax, I scan for keywords like Wikimedia etc, I already merged that one, oh well [19:00:08] i don't see an ensure option available for prometheus::*_config instances, safe to assume there is something cleaning out unreferenced files then (such as if i rename something)? [19:00:20] sometimes we have changes like this where it can be made more focused based on tags: https://gerrit.wikimedia.org/r/c/operations/puppet/+/737186/2/modules/planet/templates/feeds/en_config.erb [19:01:24] hm, guess I have to subscribe to Luis's blog individually now [19:03:03] ebernhardson: most of them just have "present" hardcoded it seems, I would not say it's safe to assume something removes t, no [19:04:02] mutante: hmm, ok i'll have to add some cleanup bits then. thanks [19:04:07] cluster_config has an option to set to absent, the others dont I guess [19:05:09] legoktm: heh, perfect example how it's always too strict for some and too lax for others?:) [19:06:54] indeed :p lately he mostly posts about software licensing and figuring out support models for FOSS projects, which is probably off-topic, but stuff I just happen to be interested in [19:57:37] <_joe_> ebernhardson: the only way to know is to check if the directory holding the files is manage => true in puppet [19:58:01] <_joe_> that way, it will only contain files defined in puppet, and cleanup any files that are leftovers [19:59:31] <_joe_> ebernhardson: and indeed I see [19:59:32] <_joe_> File[$targets_path, $rules_path] { [19:59:35] <_joe_> purge => true, [19:59:37] <_joe_> } [19:59:43] <_joe_> so yes, leftovers are purged [20:00:02] <_joe_> (in prometheus::server) [20:00:26] <_joe_> and you don't need to add cleanup bits [20:22:43] _joe_: oh interesting, i didn't know puppet had a builtin for that. thanks!