[02:02:52] maybe you can use cron to submit the job [02:04:47] my tool on grid engine uses crontab and it works [02:10:13] !log tools.mjolnir Updating uatu to v0.1.13 [02:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.mjolnir/SAL [06:14:50] You mean cron run on the bastion? (re @pseudoalex: maybe you can use cron to submit the job) [11:44:14] !log paws updating key method #190 5631062a8a T312096 [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [11:44:17] T312096: Key deprecation notes in singleuser - https://phabricator.wikimedia.org/T312096 [12:40:19] !log paws Upgrade pywikibot 7.5.0 -> 7.6.0 #193 456d3f2fe0 T315745 [12:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:40:22] T315745: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T315745 [15:10:42] @Strainu: I would be glad to help you think through options. It's not clear to me yet what you are trying to accomplish though. There are certainly a number of bot processes running on the Toolforge Kubernetes cluster. Some use the new toolforge-jobs service and others are using hand built Kubernetes configurations. [15:11:31] @Strainu: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework and https://wikitech.wikimedia.org/wiki/Help:Toolforge/Raw_kubernetes_jobs may be helpful if you have not seen them before. [16:32:11] @bd808 basically I want to run a code in a way that allows it access to the list of jobs launched by the current user. From my experiments, this does not seem possible when running from a pod. [16:33:59] @Strainu: for an arbitrary user, or for the tool that is running the code? And gird engine jobs or Kubernetes jobs? [16:34:23] Kubernetes jobs for the tool that is running the code [16:37:17] We don't put the `kubectl` command into the images we allow folks to use on the Toolforge Kubernetes cluster. That command would make this a bit easier I guess. Generally though, kubectl is just an API client and the API is reachable from inside the Kubernetes cluster. [16:37:52] https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/ is a description of that general task from the upstream docs [16:38:54] The API URL is https://k8s.tools.eqiad1.wikimedia.cloud:6443 [16:39:29] That URL can be found in the $HOME/.kube/config file for any tool [16:40:58] Thanks! I'll give it a try a bit later today [16:46:14] !log testlabs Added komla as a project admin (T315831) [16:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL [16:46:17] T315831: Request creation of komla-vps VPS project - https://phabricator.wikimedia.org/T315831 [18:55:22] What's the state of the art of monitoring in Toolforge land these days? Is there a way to expose an error log, or create alerts based upon it? Or have a cronjob update some sort of canary webpage and monitor that? [19:01:39] tgr_: no and no unfortunately. [19:01:57] at least as a Toolforge provided service [20:11:28] Sadly, no to all of those. I think that's one of the big weaknesses in Toolforge. All the infrastructure is there, but it's reserved for production, tools not allowed. It would be possible to stand up icinga, grafina, etc, for tools to use, but that would be a silly duplication of effort given that it already exists. [20:14:05] we already had icinga and grafana in cloud VPS once [20:14:44] there should be tickets about it, like icinga2 replacing icinga back then [20:16:17] ideally https://phabricator.wikimedia.org/T127367 as well [20:18:23] there are already cases where something in cloud is monitored by new prod monitoring, prometheus::blackbox::httpd [20:18:38] Oh, no, don't get me started about Toolforge's NFS. [20:20:03] https://phabricator.wikimedia.org/T315695 [20:21:26] AntiComposite: roy649: look at this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/822181/3/modules/profile/manifests/wikifunctions/beta.pp [20:21:31] shows it's possible [20:21:39] that is the _new_ type of prod monitoring [20:21:46] and it is used there to monitor something in beta [20:21:52] and that is pretty fresh [20:22:13] presuming you have access to prod monitoring :) [20:22:23] no, Mary did not need that [20:22:31] you just need access to code review and tickets [20:22:45] it was merged by observability [20:26:21] mutante I'm not following how that gerrit maps to "I Can Haz Monitoring in Toolforge"? [20:26:23] ELIF, please? [20:27:56] "All the infrastructure is there" is not really true at all inside of Cloud VPS, but I understand why that confuses folks. [20:29:49] roy649: it only works for the type of monitoring that can work from external. a public HTTP endpoint can be checked from prod monitoring and then notify via mail, IRC or automatically created tickets. what it can't replace is any type of internal monitoring that actually needs access. But I would argue most cases what you _really_ care about is only the exernal result anyways [20:30:48] OK, I get that VPS is another step away, but for Toolforge, it really seems like some basic monitoring and alerting service fits the "shared project hosting environment" concept. [20:31:31] the point is in that gerrit code change you can see how that was actually done (for the first time ever using the new monitoring) for a URL in the beta cluster [20:34:11] But, "All the infrastructure is there" really means there's in-house expertise on running the various bits of software. Even if it means having to stand up a new instance of things because the production instances of icinga, etc are not reachable from toolforge (and I totally get the requirement for that level of isolation), that's a task that WMF SRE already knows how to do. [20:34:22] roy649: I don't think anyone on the WMCS team would disagree, but time and other resources have not been found to build something that is multi-tenant and scalable yet. The multi-tenant bit has been the blocker to any "off the shelf" FOSS monitoring stack I've looked into in the past. [20:34:33] so now it could potentially be just about writing a couple lines of code but not about reinstalling the entire monitoring system [20:34:41] Sure, it's a non-zero amount of work to do it, it gets amoritized over all the tools that would take advantage of it. [20:34:48] all of the WMF production monitoring stuff is single tenant [20:35:08] which means it's mostly useless inside of Cloud VPS & Toolforge [20:36:04] mutante: sigh. using production monitoring infra for monitoring random tools still isn't scalable (puppet needs SREs to merge), and it's pretty much fully against the realm separation idea we've been working for years to properly implement at this point [20:36:20] by multi-tenant I mean a system where there are multiple completely isolated end-user compartments so that UserA and UserB can seen different things and not see each other's data. [20:36:57] conviniently I was fully ignored when pointing that out on the wikifunctions monitoring task [20:37:15] There are certainly a large number of multi-tenant SaaS monitoring services, but to my knowledge none of them have self-hostable FOSS versions [20:38:59] taavi: I would call it being pragmatic. On one side is "may have to wait once for a merge" on the other side "no monitoring at all". effectively she got what she needed and observability was fine with it and we had no actual scaling issue [20:39:35] bd808: finding good self hostable FOSS software is hard [20:40:28] a public endpoint can be monitored from anywhere, that does not make it less separated, it's a public endpoint either way [20:41:16] mutante: sure, but if the contact team there was not abstract-wikipedia and instead 2 random volunteers there is a lot more complexity to deal with [20:41:24] and monitoring the public endpoint is what people care about most of the time, as opposed to some internal metrics like disk space [20:42:15] bd808: I don't really see why, it would be a different email address or channel [20:42:19] its nice that a Foundation team can take advantage of this testing system, but in no way is it scalable to 1000 Toolforge maintainers [20:42:23] every tool has an email [20:42:51] You'd just need to magic tool name into email [20:43:31] RhinosF1: and then your monitoring of Toolforge depends on configuration and services from inside of Toolforge... [20:44:17] bd808: but if all of toolforge goes down, you got a bigger problem than random wiki user's tool webpage not being quite right [20:44:27] And they probably can't do anything anyway [20:45:25] mutante: I've spent probably dozens or hundreds of hours of my free time at this point on cleaning up such 'convinient' hacks to make beta more maintainable and reliable. Pretty much everyone else constantly undoing that progress is the reason I don't touch beta anymore [20:46:11] taavi: adding monitoring in production to monitor a tool in beta had zero impact on the setup of beta itself [20:46:27] there was no "hack" [20:46:48] there was using the brandnew puppet classes provided and encouraged by the team that is dedicated to monitoring [20:47:27] and it was merged by them which meant an intern was unblocked in their valuable work [20:51:27] Just out of curiosity, what's the total amount of traffic through Toolforge? Say, how many HTTP requests per second does the front end serve? [20:51:39] I'm just trying to get a vague idea of what "scalable" means. [20:52:48] it's not traffic that's the problem [20:56:14] mutante: I consider using the blackbox define on role::alerting_host alone a hack since it creates duplicate checks (2*n*m instead of n where n is the number of prometheus hosts in a core dc and m the # of icinga hosts) not even counting the one-off check params such as the use of the http proxy [20:56:59] plus there's the system going against the idea of https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines [21:01:16] roy649: iirc we have ~1.5k tools with a running webservice generating ~50-100 req/s on average [21:02:14] ok, thanks. I totally get that's not a full picture of how things need to scale, but it's a start. [21:02:16] plus not counting tools with cron jobs or similar only which may want to be monitored too [21:02:20] taavi: I was merely trying to share that there is a new system that works and has just been used by others to monitor beta (_after_ a discussion with entire teams about how to do it). if you want to argue the existence of that system or have suggestions how they can optimize it and avoid duplicate checks then that would have to be with the dedicated monitoring team of the WMF (which I am not on). [21:02:26] about traffic guidelines. accessing a public endpoint is not creating a connection between the 2 [21:12:49] I am tired. I am merely trying to point out something cool that was helpful for people and then it's this type of reaction. I'll just refrain from the topic.