[00:50:29] bd808: isn't Grafana multi-tenant? [00:52:01] Anyway, thanks for the answers. Yeah the puppet solution doesn't look very scalable (plus want to monitor a cronjob, not a webservice). [00:58:15] I guess a hacky but simple solution would be to make the bot record the timestamp of last successful run, make a simple webservice to turn that into a health API, and that i can use some freemium uptime monitoring service. [09:34:18] Is there some issues related to authentication in horizon.wikimedia.org. I was not able to log in with my wikitech login, which I always use and still logs me into wikitech.wikimedia.org. [09:34:18] [09:34:20] it gives me the following message: [09:34:21] "An error occurred authenticating. Please try again later." [09:35:24] Hmm. Looks to be broken for me too [09:43:28] same here [09:52:57] dcaro: ^ any known issues? [10:08:39] * dcaro looking [10:10:11] I have login issues too [10:11:54] created T315980 [10:11:55] T315980: Openstack Horizon login failing for many users - https://phabricator.wikimedia.org/T315980 [10:11:57] to keep track [10:15:13] I think this might be the cause (not sure): Aug 23 10:10:00 cloudcontrol1005 keystone-wsgi-public[3922898]: 2022-08-23 10:10:00.823 3922898 WARNING keystone.server.flask.application [req-834ddaf2-730a-4a32-a664-38a3c29334a3 - - - - -] check_safelist() missing 1 required positional argument: 'domain_id' [10:15:20] looking [10:15:34] dcaro: horizon is the only one failing for me. I can login to idp fine. You said on the task you can't even get on idp right? [10:15:56] I can't log into horizon using the same credentials I use for idp [10:16:02] Oh right [10:16:11] That sounds confusing on the task [10:17:19] sorry, will rephrase [10:21:44] dhinus: quick review? https://gerrit.wikimedia.org/r/c/operations/puppet/+/825724 [10:21:56] sure [10:30:13] thanks! [10:31:51] I'm able to login now, RhinosF1, dhinus, Reedy can you try? [10:32:19] works for me! [10:32:50] dcaro: looks good to me [10:33:51] awesome :) [10:34:16] I'll fiddle around a little to make sure nothing else is broken, but let me know here or on the task if you see anything too [10:39:12] seems ok, I'll keep an eye on logstash for a bit today [10:40:50] amal_paul can you let me know if it works now for you too? [10:54:42] Yeah, thanks. It works now? [11:08:05] amal_paul it should yes :) [11:08:10] * dcaro grabbing a bite [11:40:22] !log paws setup requests for nbserve T315670 e558ee7f619a590838e106c9866e2f8ebae33e58 [11:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [11:40:25] T315670: Request for nbserve pod - https://phabricator.wikimedia.org/T315670 [12:00:42] * dcaro back [12:16:04] Not sure I fully get the multi tenant requirement here. For logs I get it, I would not want volunteer A to see the logs of the tool of volunteer B ; but if there's some system that checks an HTTP endpoint of my tool, or of my cron job, I would not care if other volunteers (or the whole world) can see it 🤷‍♂ (re @wmtelegram_bot: by multi-tenant I mean a system where there are multiple completely isolated end-us [15:49:00] @JeanFred: when I'm thinking about multi-tenant alerting I'm less concerned about visibility of alerts (which likely are helpful to be public) and more concerned about who can silence or trigger an alert and if anyone can sign others up for notifications of an alert. There are potential harassment vectors lurking in those features. [15:49:59] And if the answer is "ask a small group who have the power to change things" that has scalability issues IMO [16:41:50] When you say monitoring/alerting, do you mean alert management (based on logs), active polling (is a service working?) or both? [16:47:11] @MaartenDammers: undefined at this point. I've heard people asking for both. Active polling seems to be more top of mind for many however. [16:48:10] Sometimes this feels like an XY problem too. I don't think monitoring is the goal, it's just a tactic towards some goal (self healing services?) [16:48:20] Active polling is much easier to set up and imho more useful. You define a state in which a service should be (a webservice returning a certain content for a page) and who to alert. Probably easier also to set up multi tenant [16:49:17] Yeah. Like what I had with the uploader some time ago. I don't want to monitor it, I just want it to always run so I just have the grid engine kill it when it's stuck (re @wmtelegram_bot: Sometimes this feels like an XY problem too. I don't think monitoring is the goal, it's just a tactic towards some goal ...) [16:49:21] For webservices, I think we could get more short and mid term value from adding an ability to setup Kubernetes health checks than from building out a "ping all the webservices" service [16:50:31] I recall for webservice monitoring I usually had ping of the host, check if a server was listening on the port, ssl check (we all hate expired certificates) and a test page with fixed content that should be returned. [17:51:06] !log tools.bridgebot Added explicit IRC message splitting configuration (T315951) [17:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [18:01:35] (testing, ignore) This is a long message sent from the telegram side of the multi-service matterbridge configuration used in this channel. A message of more than four hundred characters is needed to test the newly applied configuration to explicitly split messages which are over four hundred characters long when they are emitted by the IRC bridge end point. We really do not want [18:01:36] to encourage this kind of long message, but we also would like to avoid information assemetry caused by some messages being fully visable only on one of many interfaces (like Telegram but not IRC). See https://phabricator.wikimedia.org/T315951 for more details on the use case being enabled and the investigation of the upstream software's configuration and code. [18:02:07] quiddity: ^ it seems to work with explicit config. :) [18:03:45] Huzzah! Much thanks!! I'm also happy that it appears to be easier than I'd expected! [20:43:20] We could have the config in Git, like the Jenkins configuration? [20:43:21] (Version control is probably the only sane way to manage such resources anyhow). (re @wmtelegram_bot: And if the answer is "ask a small group who have the power to change things" that has scalability issues IMO) [20:44:10] As for harassment through signing others to notification - never thought of it I must say. Now I have the idea of messing with people by adding them as maintainer of a toolforge tool, and mess up a crontab so they get a billion emails :-p (re @wmtelegram_bot: @JeanFred: when I'm thinking about multi-tenant alerting I'm less concerned about visibility of alerts (which likely are...) [20:45:30] @JeanFred: how config could/should work would likely depend on the upstream project chosen. And since we don't have anyone working on that evaluation yet the answer is "maybe". [20:46:15] notification spam harassment is a thing, yup. humans are horrible and will hit each other with any sticks they find laying around. [20:54:15] Notification spam wouldn't even shock me [20:56:41] I seem to have some issues reading that. According to the python client code (because the docs are not really verbose), `config.load_kube_config()` should load it, but I also tried `config.load_kube_config(config_file="/data/project//.kube/config")` with no luck. I'm following this example code: https://github.com/kubernetes-client/python/blob/ada96faca164f5d5c018fb21b8ef2e [20:56:42] cafbdf5e43/examples/pod_config_list.py (re @wmtelegram_bot: That URL can be found in the $HOME/.kube/config file for any tool) [20:57:29] I have confirmed the file is visible from the code and has the correct contents [20:58:25] The eventual error I get is: [20:58:25] kubernetes.client.exceptions.ApiException: (403) [20:58:27] Reason: Forbidden [20:58:28] HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'e75fe91c-f60d-4367-8748-7922aa0b939a', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b5e6d0ac-fd38-4495-92d8-4f963bf771a3', 'Date': 'Tue, 23 Aug 2022 20:52:43 GMT', 'Content-Length': '245'}) [20:58:30] HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"patrocle\" cannot list resource \"pods\" in API group \"\" at the cluster scope","reason":"Forbidden","details":{"kind":"pods"},"code":403} [20:59:02] But I suspect this is because the config is wrong, as `kubectl get po` works from the tool itself [21:01:41] "at the cluster scope" is the key piece of information here. your tool only has access to its own namespace, not the entire cluster [21:01:47] @strainu: my first guess is that you are not scoping your query to the "tool-{tool name here}" namespace that matches the tool's credentials. `kubectl` picks up this default namespace name from the $HOME/.kube/config file. [21:02:38] I believe you're right, I'm just not sure how to scope it down [21:03:32] the namespace in the config file is correct, `tool-patrocle` [21:04:11] but the API is `list_pod_for_all_namespaces` :) [21:06:20] I think that `list_namespaced_pod(namespace)` is the endpoint you need. [21:10:36] well, progress, but not what I had expected: [21:10:37] ``` [21:10:39] [21:10:40] Listing pods with their IPs: [21:10:42] 2 [21:10:43] 192.168.25.132 tool-patrocle patrocle-mcwnw [21:10:45] 192.168.100.37 tool-patrocle robot-status-scnhw``` [21:11:21] I believe the first one is the pod where the console is, right? [21:11:33] and the other is the pod where I run my test code [21:12:36] so probably the server is wrong, because `print(configuration.Configuration().host` gives me http://localhost [21:22:55] @Strainu: `kubectl describe po patrocle-mcwnw` shows me that one is a job you launched with toolforge-jobs-framework. That can be seen in the pod labels. [21:39:39] silly me, I obviously wanted the jobs, not just the running container. I managed to get them, thanks for all your help @bd808 [21:48:52] I guess my point was, monitoring notifications for tool-A could go to the listed maintainers of tool-A - Its already a notification channel anyhow. [21:48:52] [21:48:54] I don't recall for sure but adding someone as tool maintainer is instantaneous and does not require that person agreement? (re @wmtelegram_bot: notification spam harassment is a thing, yup. humans are horrible and will hit each other with any sticks they find layi...) [22:18:44] @JeanFred: [[WP:BEANS]]! (but yes)