[00:10:36] finished running scripts [08:42:13] good morning :] [08:42:33] MediaWiki has a bunch of SQL queries that takes more than 30 seconds such as `(max_statement_time exceeded) (db1175) Function: SpecialRecentChanges::doMainQuery` [08:42:39] should we file tasks for each of them ? [08:43:12] hashar: there's many tasks on that [08:43:24] the message also has the full query attached and sometime that means the JSON payload send to logstash is too long, it thus ends up being truncated and those logs are no more tagged with type:mediawiki which essentially hide them [08:44:53] hashar: https://phabricator.wikimedia.org/search/query/rYa2hyV875_o/#R [08:47:53] RhinosF1: ahhh nice [08:49:55] hashar: the best person to handle this is probably going to be Amir, but may be out for some time ATM [08:54:14] jynus: good point :) [08:54:38] we both happen to attend the same meeting on thursday so I guess I will speak a bit about that with him :] [08:54:40] thx! [08:54:45] with handle I mean provide feedback [08:55:11] obviously the bugs should be fixed by each code maintainers :-) [09:39:03] godog: FWIW, the sandbox branch sandbox/filippo/pontoon-swift can't currently be rebased onto production (error: could not apply 82f37ec5b5... pontoon: HACK disable safe-service-restart); so I'm going to try and work on top of that branch as-is. But the docs suggest in general that this isn't a good state of affairs for a pontoon branch [09:39:54] [side q: if I want to make changes to that branch, will git review -R still DWIW for getting review?] [09:44:32] jynus: I have some questions for you wrt host backups (not DBs), where's the best place to ask, here, another channel, in private? [09:45:05] here is ok, but what do you mean with host backups? [09:45:29] backup::set in puppet for backing up specific directories [09:45:59] ask here, if you think it could involve discussing private data, pm me [09:47:21] nothing that private, so, we'd like to add cumin and spicerack logs on the cumin hosts to the backup, and as they might be useful for later audits or reconstruction of things or even stats, it would be nice to have them kept for a longer than the default retention [09:47:32] from https://wikitech.wikimedia.org/wiki/Bacula#Retention AFAIUI the retention is set per-pool, is that correct? [09:47:48] sort of [09:48:16] tell me what you want and I would advise on the best way to achieve it, or how easy it is :-) [09:49:40] but not sure if backups would be the best way to "keep logs" [09:50:03] ideally backup /var/log/spicerack and /var/log/cumin with a long retention. At that point to be decided if it's quicker to just backup the whole /var/log [09:50:26] bacula is a terrible place to gather stats [09:51:09] sure, but if needed one can restore the data to a host and then play with it, right? [09:51:40] is not for everyday usage, but to keep the data on host refresh/reimage or broken host [09:52:14] unfortunately that data can't go in our current logstash and AFAIK we don't have any ETA for the "private" logstash [09:52:17] can data be restored? sure, but I would advice against trying to use it frequently- because how storage works [09:52:34] as in, don't think of it as hadoop-like [09:52:49] sure, I wasn't planning on it [09:52:50] it is encrypted and compacted, performance is not a goal [09:52:56] best case scenario it will never be accessed [09:53:29] and things like different retention policies will be an issue- we will have to set a separate pool [09:53:31] (T213902 for reference on the logstash side) [09:53:31] T213902: Implement sensitive logstash access control - https://phabricator.wikimedia.org/T213902 [09:53:53] yes, that would be the ideal way, and then we can backup up if needed on top of it [09:54:16] my guess is you want to make use of incrementals for efficiency on bacula, is that right? [09:55:21] the question is, if you want to find out "when this file changed" you will have a lot of pain with bacula [09:56:14] the question I'd like to be able to answer are things like: [09:56:22] do you have a ticket for the overal need? [09:56:41] I think it will be easier to discuss there options [09:56:49] I can open one [09:57:00] not saying no, just doesn't look good [09:57:14] but maybe with more info we can find either how to use it [09:57:17] or an alternative [09:57:42] let me open one [09:57:49] which tag should I put? [09:57:50] after all, on the team we take care of all things data, so we will likely find a solution among all of us [09:58:28] "Data-Persistence (Consultation)" is probably a good one [09:58:34] as you are asking us for advice [09:59:12] dump on the body the need with all details (sizes, how frequently changes, etc.) [09:59:19] retention, etc. [09:59:32] ack [09:59:37] and on a comment the possibility you thought [10:00:00] and I will see if bacula is the best way or we can think of an alternative method [10:00:56] I belive your need goes more on long-term archival needs [10:01:22] and we have no solution for that, except in the specific case of dumps [10:02:28] no solution doesn't mean it cannot be done, just that there is no canonical/perfect service for that [10:03:15] ack [10:03:54] thanks for pinging me on this, I appreciate open a discussion line rather than building it right away [10:04:43] I'll probably send a patch to add the /home directory, but that will just be using the standard backup::set and doesn't need any special treatment [10:05:07] sure, if that partially helps, that is ok [10:05:19] as multiple people have asked to keep the /home on reimage and apparently have important data there [10:05:24] more backups is always good [10:05:28] (most of them from this team :-P ) [10:10:34] the thing is thinking about bacula as a cold backup thing, and is different needs from "filesystem snapshots" are needed (object storage, dataset storage, statistics, long term archival) usually a different method has to be built [10:10:43] *if [10:11:04] hence the dbbackups, mediabackups, dumps, hadoop solutions [10:11:39] sure, I think about bacula as AWS glacier [10:12:12] as the data is there, better if you don't need it, but if really needed then it could be accessed [10:12:21] yeah, it is very tape oriented [10:12:36] even if we don't use tape, so very unflexible [10:13:28] the problem I see about a log directory is the log rotation, that might make the thing inefficient [10:15:11] just to be clear, backups are always ok to have, and we can set them up right away [10:15:50] it is the other needs that didn't seem to fit very well, but I will be waiting your ticket with more details to say [10:40:04] jynus: I've created T304497 [10:40:05] T304497: Implement persistence of spicerack and cumin logs to survive host reimage/refresh/failure - https://phabricator.wikimedia.org/T304497 [10:41:56] you are completely right that bacula would be very inefficient for logs [10:42:14] a single line added (not only the renames) would make a new copy of it [10:42:39] there is no within-file incrementals [10:52:34] See my comment at: T304497#7799359 [10:52:34] T304497: Implement persistence of spicerack and cumin logs to survive host reimage/refresh/failure - https://phabricator.wikimedia.org/T304497 [13:06:41] Emperor: ack re: sandbox/pontoon-swift yeah I think for now working on top of that will work, with respect to git review what I do is flip the patches I want between the sandbox branch and my topic branch as described in https://wikitech.wikimedia.org/wiki/Puppet/Pontoon#Get_patches_ready_for_review [13:08:00] feel free to start another sandbox branch on top of current production too if that's easier, some of the patches in my pontoon-swift can be merged in production now I think, I'll look into that today/tomorrow [13:44:17] I think I want at least some of your changes (but will need to make changes to them too), so I'll make my own branch on top of your sandbox so I can rebase if you rejig your branch later [13:46:50] Emperor: SGTM, I'll send my production-mergeable changes your way [14:39:18] hello! As a heads-up, I'd like to delete a bunch of rows from `user_properties` table that are no longer necessary. The table at T304461 has row count numbers (in the "Mentorship" column) for a few wikis this will affect, but all rows with up_property='growthexperiments-mentor-id' will be affected. My question is: What's the best way to do this kind of thing? Should I create a single-purpose maint script and run it? [14:39:18] Or runBatchedQuery.php from core? Or just do the DELETE manually? [14:39:19] T304461: Delete `growthexperiments-mentor-id` properties from user_properties - https://phabricator.wikimedia.org/T304461 [14:43:51] urbanecm: assuming it is not something urgent, I would advice to add the "Data-Persistence (Consultation)" to the tag [14:44:11] and add amir or manuel to the task, they may not be around right now [14:44:30] removing rows without a later optimization may have no actual effect [14:44:52] plus having a recent backup before running it would help avoid mistakes [14:45:33] urbanecm: better to use a script with the usual wait for replication and all that safe measures [14:45:42] running a delete manually is never a good idea in production [14:52:38] sigh, my branches can't be rebased onto sandbox/filippo/pontoon-swift there's been too much divergence [14:53:06] I think make new branches and cherry-pick is going to be less disruptive to my changes-for-prod [14:53:33] jynus: thanks for the tag info, i wasn't aware of it. I'll add it there. [14:54:01] marostegui: by a script, you mean a sui generis script for this maintenance work? or the runBatchedQuery.php thing? [14:55:08] I don't know what that other one does (I can guess by its name), but yeah, a maintenance script implementing the usual safety methods. [14:55:16] probably Amir1 knows more in depth [14:55:32] godog: I can't usefully just rebase my branches, because there's too much divergence between prod and your pontoon-swift branch. I'm going to try cherry-picking the changes onto my own sandbox branch to see if it's possible to make progress [14:55:54] marostegui: okay, thanks. [14:56:36] urbanecm: it depends on how long it takes, if it's something small, you can do it with sql wiki --write [14:56:49] but do it in small batches [14:57:03] like deleting 100 rows per query so replication can catch up [14:57:19] Amir1: it's thousands of rows (there are wikis with 190k+ of those rows) [14:57:22] 1% of user_properties [14:57:30] but if it's more complicated than that. Then create a maint script [14:57:51] urbanecm: I wrote a bash script to do this [14:57:55] I can share it with you [14:58:24] would be nice, assuming it's fine to use it (Manuel said a few lines above manual deletes are not a good idea, and this kinda contradicts the statement) [14:59:07] "manual deletes are not a good idea" it depends ^^ [15:00:40] My general advice is not to do them, as they lack any sort of wait for replication or batching [15:06:52] yup. That's why I put them into this bash [15:07:04] the bash I used to use from time to time is this: [15:07:14] https://www.irccloud.com/pastebin/lCZK5jOu/ [15:07:21] urbanecm: that'd be useful ^ [15:07:41] thx [15:08:06] but be careful and with great power, stuff like that [15:08:14] do not remove the sleep [15:09:02] suggestion: coordinating on ticket may be the most efficient way? - for example, I also want to be aware of it for the backup side of things 0:-) [15:30:04] godog: oh, also, you disabled puppet on ms-fe-01 in swift pontoon on Nov 24 20201; is re-enabling it likely to work, or was something broken then? [16:02:29] Emperor: SGTM re: cherry-picking my changes, in ms-fe-01 I can't remember why puppet was disabled but re-enabling is fine [16:03:01] as in, nothing depends on that stack being up/working [16:11:26] godog: thanks. I'm afraid neither pontoon swift cluster seems quite happy - swift-dispersion-report on thanos complains of ECONNREFUSED from thanos-swift.discovery.wmnet (should it be looking elsewhere?), and in ms-fe-01 complains that swift-dispersion-populate hasn't been run. Should swift-dispersion-report work in pontoon? the swift ring manager code relies on it as a cluster health check... [16:23:34] (looking at dispersion.conf it looks rather like it's trying to talk to actual-swift, which is a bit worrying) [16:23:34] Emperor: yeah swift-dispersion is supposed to work, it's been a while since I last looked at the stack though and it looks like it is in a worse condition than I remembered, I'll take a quick look now for an hour or so and see if I can restore the stack [16:23:49] why worrying ? [16:23:52] godog: thanks, that'd be really helpful :) [16:24:12] godog: I'd expect swift-dispersion-report from pontoon-swift to talk to pontoon-swift not real-swift. [16:28:40] ah by real-swift I'm assuming you mean the dns name that swift has in production (?) that line is a bit blurred in pontoon [16:28:56] I've written down some docs on that at https://wikitech.wikimedia.org/wiki/Puppet/Pontoon/Services [16:30:50] I'll poke at the swift stack a bit and report back [16:31:19] Oh, maybe `host ms-fe.svc.eqiad.wmnet` isn't using the right resolver, since it tells me 10.2.2.27 [16:31:30] 👍 [16:59:53] Emperor: going to force-push a branch to puppet-01.swift.eqiad1.wikimedia.cloud FYI [17:25:12] godog: Sure, thanks. I have all my work in commits locally [17:46:16] Emperor: ok now ms-fe is definitely in a better shape, swift-dispersion-populate is running ATM and should be done shortly. thanos I'll take a look tomorrow [17:46:43] Emperor: I've also pushed a fresh sandbox/filippo/pontoon-swift you can apply your patches on top of and force-push [17:47:01] Thank you, that's really helpful :) I'll have a go at that in the morning [17:48:02] sure! I have to run a few errands tomorrow morning and will be online in the afternoon for sure [17:49:24] also for the record what I did is turn on service discovery and provision a small instance with pontoon::lb role [17:50:38] and a gentle nudging of the pontoon master, your patches will need to change the pontoon puppet master that puppet-01 runs as opposed to the production puppetmaster::frontend bits [17:51:13] but other than that the stack should be more or less functionally equivalent to production