[10:02:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:56] (lost a couple of deletion races with an admin) [10:17:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:38] any taker for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198962 ? The host is already depooled [12:47:30] federico3: happy to look, what verification would be suitable? [12:48:04] essentially only that the hostname is really the one in the subject and task :D [12:50:28] ack [12:54:34] tnx [14:26:15] so there are still a lot of {}.format() on the old code, that's why I mentioned it [14:26:48] Emperor: federico3 do you have a list of those you use? [14:27:57] or does https://codesearch.wmcloud.org/search/?q=import+transferpy&files=&excludeFiles=&repos= cover all? [14:28:33] My guess is there are some that use the cli rather than the python class [14:28:49] like wmfbackups does [14:29:53] afaik the cookbooks in the list on codesearch [14:30:18] interestingly, I am not sure it indexes gitlab [14:30:37] Mmm, I'd expect cookbooks to use transferpy by import rather than CLI (and am not aware of counterexamples) [14:36:44] in your case, your usage seems very simple, (and now it only outputs a couple of lines if succesful) [14:37:07] maybe just downloading and sending the result elsewere would be enough to test it? [14:43:54] The natural test would be to pick a container and run the cookbook against it (it asks before doing any deletions) [14:45:35] thanks for the clarification, I wouldn't yet be ready for fully autonomous operation [14:46:11] sorry if it is a lot of questions, but if it is a deletion tool, what's the role of the transference? [14:47:58] and a second question, would you have some time to hand hold me on doing such an operations this or next week? [14:48:13] it needs to copy all three copies of a container DB to the working host in order to do comparisons between them [14:48:19] I see [14:48:20] thanks [14:49:12] I also solved some of the blockers you mentioned [14:49:52] I'm not sure what the requirements and limitations were when transfer.py was developed but maybe things have changed in the meantime [14:50:13] well, it had to work on very old versions of python [14:50:42] today they may be 15 or 20 years old [14:52:39] Emperor: https://gerrit.wikimedia.org/r/c/operations/software/transferpy/+/1198501/1/transferpy/Transferer.py [14:52:44] it seems to be using one netcat instance... if that's the case maybe using multiple connections (e.g. with rclone) could use all available NIC bandwidth and be gentler on network devices [14:53:10] no, that's not what this tool is for [14:53:25] this tool original usage was for use with xtrabackup - and still is [14:53:37] so it has to be through a unix pipe for remote backups [14:53:54] it is the opposite of trying to be gentle [14:55:06] and I know you are saying it in good faith, but you are the fifth person to point "you should probably use rclone", without understanding the requirements of the tool [14:55:38] it needs to do streaming for xtrabackup to work [14:56:30] but add a patch implementing a --type=rclone copy and I would be happy to add it [14:56:50] it is NOT a general copying files tool [14:57:02] it was build for xtrabackup remote transfers and recoveries [14:57:27] so you could transfer a live & running db without stopping it [14:58:19] then there is the aging and lack of maintenance. e.g. today I would use socat [14:59:21] that is exactly the kind of things I was saying to please not bring up because it was not on the scope of my patches (making it work for trixie/nftables) [15:06:11] hey jynus if you'd like to take a look and give me some feedback, there are a couple of alerts (warning) on backupmon1001 ... some time ago, we discussed how to retrieve the plugin output, and here you can see it by using the log link in the alert: https://w.wiki/Fpky [15:08:21] tappof: that seems fine to me, is there anything you would be looking to comment on? tags? text? [15:09:22] maybe the name of the alert is a bit unclear, and a custom name could be added, or taken from other property? [15:09:50] let me see what's the icinga equivalent [15:10:46] Yeah, sure I’d like to add something mentioning that it’s (or was) an Icinga NRPE check [15:11:06] maybe de summary should be the name instead, and leave the check_name as a tab? [15:11:31] I am not sure [15:12:01] yeah, maybe a #npre2prometheus extra tag or whatever [15:13:13] imho we can remove those from icinga quickly, as they are not immediate alerts requiring human interventions right away [15:13:27] and we have a plan B in the form of a custom dashboard [15:13:50] so we can do that anytime you feel confortable with (for dbbackups checks) [15:14:30] Ah, I see what you say, we lose the text component of it [15:14:36] as part of the alert [15:14:54] I mean, we don't lose them, but you know what I mean, embebbed as part of it [15:15:26] not a huge issue for this, but I wonder what other people say as in this case is informative [15:15:41] but in other cases it points to the source of the issue [15:16:17] and I think some DBAs were a bit unconfortable with "losing" that [15:16:28] But I cannot see a way to overcome that [15:16:38] yeah, it’s not embedded anymore, but you can retrieve it directly from the alert [15:16:46] following the link [15:16:47] could a bot use it? [15:16:53] as in, for IRC? [15:17:00] I think that would aleviate the main issue [15:17:11] again, not important for this check [15:17:25] I think it’s an interesting suggestion [15:17:44] I’ll file a task for that [15:17:48] so to be clear, my only suggestion right now is to make the summary the name [15:17:56] which I assume would be easy to do [15:18:26] the other is mostly some discussion to be having for other owners, I am happy as it is [15:19:18] definitelly I would keep the text on prometheus itself (I am not sure if it is being done noe) [15:19:54] I don't know if event.original comes from the script or from prometheus [15:20:13] from the script [15:20:19] but I think it would be important context in other cases [15:21:05] again, not a blocker for this one in particular, but I am almost hearing the dbas complaining about it (amir and manuel are not around today) :-D [15:21:13] he he [15:23:00] So, IIUC, in this case you'd like the summary to be "check_mariadb_dump_db_inventory_codfw", right? [15:23:29] well, if it is on check name, it wouldn' be needed? [15:24:09] alertname=dump of db_inventory in eqiad or something similar? (spaces to -> underscores) [15:28:12] Aaah, ok... I'll check if it's practical to do quickly ... [15:28:42] not super important, can be tuned later [15:28:53] in general looks fine [15:29:04] I am more worried about losing the text of the alert [15:29:12] for other alerts [15:30:08] doesn't have to be in prometheus (but I hope it would be there), specially for printing to IRC, slack or other places [15:33:48] I don’t think having it in Prometheus is really feasible, since a new series would be generated for every possible label value, potentially causing high cardinality. [15:33:54] I think we could discuss a bot/process/job—or something similar—that, given an NRPE-related alert, would retrieve the last entry from Logstash and publish it along with the alert. [15:34:43] yeah, the internals don't worry me as much, as the final alerts that reach us [15:35:04] but I would suggest waiting until tomorrow to talk to the ones that would give a more informed opinion [15:35:42] backups alerts are easy (important, but not time-sensitive) and are motly covered by a dashboard [15:35:47] *mostly [15:36:20] I would wait to hear from the 3 dbas, once they are all back [15:36:30] sure, meanwhile thanks for sharing your thoughts with me [15:41:56] it is a pity because this all comes from a new "model" [15:42:07] one based around floats [15:42:27] and about continuing 1-minute granularity checks [15:42:55] while before we just checked every 30 minutes, and needed no more data than 48 times a day [15:43:17] it is hard to fit a new model on an old way of doing things :-/ [15:50:38] I see what you mean. I tried my best to correlate the Icinga timing parameters with the for: entry in the Prometheus alert rules. The retries are not executed once per minute, but rather at least $check_retries times within the maximum period during which Icinga had definitely identified an alert situation. [15:50:51] This is partially due to the fact that Prometheus does not have separate timings for checks that are in a normal state versus a pending one. [15:54:27] If I remember correctly, in this case the checks are executed every 10 minutes... but it’s still tunable by adjusting the timing parameters of the NRPE check. [15:55:26] no worries, I was just commenting that it is a hard problem [20:28:59] I need to create some tables in extension1 for metawiki and loginwiki, for database tables defined in Extension:CheckUser. I'm not sure of which command(s) I need to run. [20:29:34] https://wikitech.wikimedia.org/wiki/Creating_new_tables#Deployment doesn't talk about extension1