[08:40:05] I want to setup puppet on a new ganeti vm (gitlab2001.wikimedia.org). The ganeti documentation says I have to run install_console FQDN on the puppetmaster (https://wikitech.wikimedia.org/wiki/Ganeti#Update_the_DHCP_config_with_the_MAC_of_the_new_VM). The documentation for server lifecycle says I have to run install_console on cumin (https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation). I was wondering what is the prefered way for [08:40:06] Ganeti VMs? [08:44:09] jelto: yes, the ganeti VM installation post the makevm cookbook that creates it and gives you the MAC aadress is still manual (but will not be like that for very long) [08:44:40] the docs should be up to date AFAIK [08:46:13] volans: ok thanks [08:51:57] jelto: it doesn't matter, you can do the initial login from either a puppetmaster frontend of a cumin host [08:53:07] install_console is just a small wrapper around a special SSH key for early install, see profile::access_new_install in puppet if you want to have a closer look [08:53:22] moritzm: thanks for the clarification! [08:53:53] ah, and the above should be "puppetmaster frontend or a cumin host" :-) [16:16:33] kormat: two of the three manual tagged runs completed [16:17:28] [18:21 UTC] krinkle at mwmaint2002.codfw.wmnet in ~$ mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 --tag pc2 [16:17:28] Deleting objects expiring before Tue, 29 Jun 2021 18:21:46 GMT [16:17:28] ... 100.0% done Done [16:17:28] [04:16 UTC] [16:18:00] that'a almost exactly 8 hours for 1 shard [16:18:08] including all the delays and stuff [16:18:18] but, at some point the scheduled run started running alongside my three tagged runs [16:18:27] which I regretfully did not anticipate / think about [16:18:40] so one of the servers got double the deletion workload [16:20:08] I'll summary all this on-task [16:23:19] Krinkle: ok, thanks :) [16:33:16] kormat: https://phabricator.wikimedia.org/T282761#7187777 [16:33:32] next steps, I think we should update the cron jobs to use the tags right? [16:34:40] I estimate a normal run would take more than 8h (since there was less than 24h between my runs so it got an unfair advantage) and less than 17h (since it was cut-off and didn't actually complete the first time). [16:34:45] but looks optimistic! [16:34:58] Krinkle: i'm kinda suspicious that everything is running so fast [16:36:24] well. i guess all the data we have so far is 'dirty' [16:36:36] there hasn't been a steady state to evaluate things against [16:36:53] i'm certainly not going to argue if we are comfortably under 24h [16:37:43] Krinkle: +1 to switching to 3 tagged cronjobs [16:37:52] yeah, I'm not 100% sure either. [16:37:52] https://grafana-rw.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&forceLogin=true&from=now-30d&to=now&var-server=pc1008&var-port=9104 [16:38:03] This looks weird as well, the deletion rate was never 0 which seems impossible. [16:38:06] normally i'd prefer to not do this shortly before everyone is gone for a week, but the chances that it blows up during that week if we _don't_ do it seems higher [16:38:16] I checked for any unexpected proceses but didn't find anything else [16:38:50] although it doesn't help that the cli-wrapper for systemd creates 4 processes that almost all look the same :D [16:39:12] www-data 0:00 /usr/bin/python3 /usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 [16:39:12] www-data 0:00 /bin/sh -c /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 [16:39:12] www-data 0:00 /bin/bash /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 [16:39:12] www-data 2:41 php /srv/mediawiki-staging/multiversion/MWScript.php purgeParserCache.php --wiki=aawiki --age=1814400 --msleep 200 [16:39:30] why take one shell when you can have two [16:39:43] yeah i saw that :) [16:40:14] > chances that it blows up during that week if we _don't_ [16:40:15] exactly [16:40:47] kormat: is "pc1/2/3" somewhere in puppet or hiera, or shall we just hardcode for now? [16:41:14] just hardcode [16:46:41] ack [16:49:55] kormat: any reason not to start them at the same time? [16:50:16] Krinkle: i've been trying to figure that out; i can't think of one [16:50:18] like would it help if they start 30min apart in case somethinb ad happens and you get paged [16:50:27] and then have 30min from the next 30% blowing up [16:51:14] (or to disable it etc.) [16:51:52] i think it's a toss-up, and would probably just go for starting them all at the same time for simplicity. it would at least make the graphs easily comparable [16:52:02] if i'm wrong, we can always change it down the line [16:52:45] ok [16:53:58] +1'd. if you're happy with it as-is, i can merge it and we'll have it run tonight [17:12:28] heya, looking for another SRE +1 on btullis ops group membership [17:12:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/702424 [17:12:46] BTW, Ben is a new SRE on the data eng team :) [17:17:29] welcome btullis! (I'm an export from the team, now on platform engineering :-) ) [17:17:32] kormat: yes [17:17:38] s/the team/the SRE team/ [17:18:37] kormat: so I'm looking at the mysql stats and I don't understand where the 400 deletes per second are coming from right now on pc2008. The scheduled run is still on pc2007 I believe, and this is confirmed by the DELETE queries I see on that one. There are no DELETE queries on pc2008, but there are 400 delete handler stats. in Eqiad this seemingly unexplained base level didn't exist. [17:19:13] https://grafana-rw.wikimedia.org/d/000000273/mysql?orgId=1&forceLogin=true&from=now-2d&to=now&var-server=pc2008&var-port=9104 [17:20:27] huh. pc2008 is still replicating from pc1008 [17:20:44] oh, right. we're still on circular replication. [17:22:43] kormat: yep, I'm planning to leave parsercache like that till we are back in eqiad unless you prefer it not to be like that [17:22:52] I'm planning to disconnect only s1-s8 [17:22:53] marostegui: WFM [17:23:04] good! [17:23:13] Krinkle: i've no idea re: those deletes, maybe marostegui knows [17:23:44] I can check binlogs tomorrow [17:24:38] marostegui: i can already feel your excitement at the prospect. gives you a reason to get up in the morning :) [17:24:52] :-D :-D [17:24:59] hahaha [17:32:27] apergos: thanks for the welcome - looking forward working with you. [17:35:18] :-) [18:48:29] legoktm: huh, so.. /var/log/mediawiki/mediawiki_job_parser_cache_purging/ got rm'rfed by puppet I think [18:48:32] and the curernt job killed [18:48:42] I.. did't expect either of those things [18:48:44] but makes sense I guss [18:48:54] I think a cron wouldn't have done that [18:48:59] it woudl have let the process finish [18:49:46] is there a clean way to kick them off earlier instead for waiting for 1AM, but with the same logic as normal such that when 1AM comes around it won't start another one. [18:50:20] kormat: ^ fyi [18:51:18] Krinkle: I can just start the services? It won't create duplicate processes if its already running [18:52:23] yeah, that part is useful, and known, since they often take more than 24h, and we've (fortunuately) seen that it won't start a second one [18:52:28] oh, wait, the process is till runnig from last night [18:52:33] it just doesn't log anywhere now? [18:53:35] ugh, I was hoping these logs were just rotated plainly with the last entries kept indefinitely if nothing new is added / gone. [18:53:48] ● mediawiki_job_parser_cache_purging.service not-found active running mediawiki_job_parser_cache_purging.service [18:53:56] maybe that's how it woudl have gone if the ensure absent didn't cascade into the log directory ensure [18:54:19] "job not-found, status active running" [18:54:21] story of my life [18:54:33] stopped it [18:55:16] and then you want me to start the mediawiki_job_purge_parsercache_pc[123].service? [18:55:23] yeah [18:56:00] pc1 hasn't finished a purge now in like a week. keeps getting killed for various reasons. gonna take a very long time now, probably a week :/ [18:57:08] oh damn. the loss of the old log is unfortunate. [18:57:18] all 3 are running now [18:58:05] thx [18:58:53] the best way to ensure persistence of the logs is to send them to mwlog / logstash I think [19:00:00] that's far too reasonable [19:00:13] but yes, good point. [19:01:56] even more unreasable is that it feels to me that it might *actually*, for once, be the "simplest" solution to solve this with kubernetes. That is, rather than having MW write its CLI script stdout to Logstash in some cases, if these are run in k8s the output would presumably just be ingested and end up in Logstash by higher-level means. [19:02:12] but coming back down to earth for a minute, I suppose rsyslog can do this today. [19:05:06] right, just no one has hooked it up to do so [19:21:21] Filed https://phabricator.wikimedia.org/T285896 [19:21:25] Filed T285896 [19:21:26] T285896: Ingest logs from scheduled maintenance scripts at WMF in Logtash - https://phabricator.wikimedia.org/T285896 [22:48:15] Just learned about some undocumented promtool behavior that makes it possible to write unit tests for Prometheus/AlertManager rules that depend on time() (e.g. to alert if too much time has elapsed since latest event). promtool's time() actually returns the number of iterations since the test began, not the current time: https://github.com/prometheus/prometheus/issues/4817#issuecomment-514765285 [22:48:27] Context: https://gerrit.wikimedia.org/r/c/operations/alerts/+/702477