[09:53:24] thanks chicocvenancio [15:18:00] !log tools.phabbot Manually run `tools.phabbot@tools-sgebastion-07:~/phabbot$ ./new_wikis_handler.sh` to re-run the new wikis bot [15:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.phabbot/SAL [15:18:29] !log tools.phabbot tools.phabbot@tools-sgebastion-07:~$ rm ~/*.err && rm ~/*.out [15:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.phabbot/SAL [15:58:45] Amir1, legoktm: codesearch just crashed [15:58:55] 16:54:04 <+wmcs-alerts> (WidespreadInstanceDown) firing: Widespread instances down in project codesearch - https://prometheus-alerts.wmcloud.org [15:59:19] O.o [15:59:53] widespread seems to go off even if only 1 instance [16:00:01] but it's got issues [16:01:03] I'll be at my laptop in like an hour [16:01:39] K [16:06:59] !log codesearch hard reboot codesearch8 after OOM crash [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [16:08:36] legoktm: ^ I think that fixed it at least for now, console was full of "Out of memory: Kill process 23034 (houndd) score 711 or sacrifice child" or similar [16:09:34] ty, it's probably time we move to a bigger instance [16:12:04] all y'alls writing too much code [16:12:37] happy to see https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra useful, btw [16:13:54] if you want I can send the alerts for metricsinfra to something else than -cloud-feed and if codesearch does prometheus metrics I can also make it can alert for other problems than "an instance is down" or "an instance is failing to run puppet" [16:30:50] majavah: why does 1/1 instances trigger widespread alert [16:31:45] because it's some percentage treshold, and the rule was designed for projects like tools with much more instances [16:34:55] majavah: would it be easy to ignore it if instances == 1 [16:37:23] majavah: is there an associated grafana instance or just for alerts? [16:38:05] legoktm: it's available in https://grafana-labs.wikimedia.org too [16:38:54] it's the "metricsinfra prometheus" [16:45:30] legoktm: if I can read openstack, there's enough quota left for a memory increase [16:46:22] Don't know if it can be changed live though [17:31:52] I filed https://github.com/hound-search/hound/issues/410 upstream for prometheus metrics, we can probably add some of our own in though [17:42:49] majavah: can this be used for Toolforge tools too? [17:59:25] legoktm: not really, metricsinfra is currently pretty much designed around cloud vps projects and I'm not expecting to change that any time soon [18:00:19] Hmm, okay [18:00:55] I've been wondering what if we just set up Prometheus/Grafana in a Toolforge tool itself [18:02:03] a toolforge tool prometheus cluster would probably by integrated with kubernetes, and I'd rather not mix it on the same prometheus instance that metricsinfra uses for cloud vps instances [20:04:18] majavah: ok, I added https://libraryupgrader2.wmcloud.org/metrics - how do I go about getting it to alert if libup_runs doesn't increase within 24h? [20:09:35] * legoktm is not going to try https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Adding_new_projects by myself [20:43:21] legoktm: I configured prometheus to scrape that, let's look at alerting tomorrow when it has stored some data and I've slept [20:44:01] https://prometheus.wmcloud.org/cloud/graph?g0.range_input=1h&g0.expr=libup_runs&g0.tab=0 [20:49:06] majavah: sounds good, thanks!