[09:20:57] "got rid of the grid" sticker or tshirt idea someone? [09:24:14] xd [09:39:15] bd808: thanks for the email. Maybe you can turn that into a blog post? [09:49:41] mmm I can't find the right quota item to bump for T360162 [09:49:41] T360162: Increase Object Storage quota for QRank - https://phabricator.wikimedia.org/T360162 [09:53:11] they're probably managed via radosgw-admin and not openstack-cli [09:53:43] arturo: we have the #wikimedia-cloud-daily channel but you don't seem to be in it [09:54:24] (re nothing, it's just that I realized this as I mentioned you there) [09:54:35] blancadesal: ok, just joined [09:54:51] 👍 [10:01:22] taavi: thanks, I think it is true [10:12:32] blancadesal: I'm interested in writing about the grid engine [10:13:47] taavi: thank you! [10:14:42] could you have a draft ready by March 25th? [10:16:57] sure. is there any style / length / etc guidance? [10:18:14] not really, but you can take a look at past editions of the newsletter to get an idea: https://office.wikimedia.org/wiki/Technology/SRE/Newsletter [10:25:21] ok! I'll probably ask you and others for feedback at some point [10:26:49] ok, looking forward to reading :) [10:27:44] taavi: you may link / reuse some of https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/ [10:28:05] wait, was the grid shutdown exactly 2 years after that blog post? heh what a coincidence [10:28:36] regarding the swift/radosgw quotas, I just created this section here: [10:28:37] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#swift_/_S3_/_radosgw_/_object_storage [10:30:21] blancadesal: btw, since the SRE newsletter is in officewiki, I hope/assume we can publish the post in a public place too :-) [10:30:30] arturo: oh wow. that was definitely not planned [10:35:49] attached a few implementation patches to T306039 [10:35:50] T306039: Decision request - Toolforge external infrastructure domain usage - https://phabricator.wikimedia.org/T306039 [10:37:19] taavi: yes, the SRE newsletter is internal. how about also deriving a blog post from it and publish it to the tech blog, like arturo suggested? [10:37:44] I was thinking of something like that, yes [10:37:49] anyhow, I will get to writing :-P [10:38:24] maybe the shutdown date is an easter egg from balloons? :)) [10:40:50] still on the theme of gridengine write-ups, I'm forwarding this from Birgit on slack: [10:40:50] "On a related note, Selena wanted to highlight the work in her weekly Friday email as well (goes out some time tonight). She was curious whether you'd have a short story (3-4 lines) that highlights helping a volunteer understand the situation and come along, or something similar. Everyone knows some volunteers find these kinds of changes difficult. A story about connecting, resolving misunderstandings, [10:40:50] collaborating, so that there is a good outcome. I think there are multiple examples for this over the course of the migration. Do you have an idea for that & could write it up? - Also, no worries if not or if it's too much effort right now. It was just an idea :-)" [10:42:01] I think creating individual phabricator tickets for each tool, then following up closely with each maintainer, that was a remarkable effort, specially from dcaro, taavi and others [11:07:31] lol at https://www.mediawiki.org/wiki/Toolserver:History "After the conference, Mark takes the server home and uses it for a coffee table." [11:15:26] xd [11:18:47] "cattle not pets" is definitely less interesting when it comes to naming servers [11:19:26] bulbasaur-1, procion-10, just add a number after xd [11:20:20] or name docker-style, adjective-name kind of thing `gloomy-fox-13` [13:45:51] I don't suppose anyone has a puppet patch that they'd like to merge? I have a thing I want to test right after [13:50:14] not me! [13:50:16] * arturo food time [13:52:35] oh, nevermind, I can simulate it [13:54:08] taavi: there's a new, cumbersome version of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009798 for your consideration [14:23:48] andrewbogott: looks good, assuming you have tested it [14:23:59] if you need a patch to merge, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1011137 [14:24:05] I have tested it a little bit [14:26:21] tested it on itself :) [14:44:17] does anyone happen to have a copy of the now-404ing spreadsheet mentioned in https://lists.wikimedia.org/pipermail/labs-l/2015-August/003955.html? [14:52:25] I'm sure I don't [14:53:57] not me! [14:55:25] If it was on his wmf gapps... it might still exist [15:01:04] taavi: hmmm... sometimes when folks get off boarded the contents of their personal gdrive are given to their manager at the time they leave. For Yuvi that would have been Chase. Who was themselves managed by John when leaving, who was managed by ... Grant? when leaving, who was managed by ?nobody?. Hmmmm. [15:02:00] so that's maybe an OIT question [15:02:19] Unless it was his personal gapps and he has actually deleted it :D [15:04:15] you could also ask yuvi himself :) [15:11:43] komla: (and any of the rest of y'all) I am going to respond to Magnus' nice message on-list later today. If you had people you were planning to thank if you had sent the final message I would be very happy to name them in my response. [15:13:59] which list did magnus write to? [15:16:38] https://wikitech.wikimedia.org/wiki/User:Majavah/History_of_Toolforge [15:44:35] andrewbogott: wikitech-l I believe [15:45:34] taavi: oh! thanks for starting a page like that. :) [16:17:00] arturo: a fun puzzle for you if you'd like one: why is outbound network traffic failing on wikidata-analytics-1.wmdeanalytics.eqiad1.wikimedia.cloud even though I can ssh in? [16:17:42] andrewbogott: is that the instance with a docker subnet intercepting traffic to where our DNS resolver is? [16:18:34] taavi: good question! It is certainly running docker [16:18:45] And yes, I noticed this because dns resolving is failing there [16:18:51] so maybe I should just give this one up for lost and move on [16:18:55] br-f6fcafc9a433 DOWN 172.20.0.1/16 [16:19:12] so yes. there's a task about it somewhere where WMDE was wondering whether they even need the instance anymore [16:19:25] great, I will plan to ignore! [16:19:34] * andrewbogott implements the plan [16:22:36] taavi is too fast, didn't even have time to read the IRC ping when the puzzle is already solved [16:28:29] yeah, I guess that VM is famous :( [16:29:42] I need to run some daytime errands; back later [16:59:30] taavi: can I get the current tool account name from somewhere in toolforge_weld? [16:59:42] more specifically, from the ToolforgeClient class? [17:00:46] well, x/y problem, the code is here: [17:00:47] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/17/diffs#d23fd07b340e14f73083127aaf876ad14c4bff94_721_758 [17:01:14] meh, nevermind, I'll just pass it as argument [17:04:45] * arturo offline [17:42:49] * bd808 lunch with friends [18:23:55] Hi! I'm trying to scrape the Quarry metrics exposed under https://quarry.wmcloud.org/metrics, that seem to be missing from Prometheus (T360220). Could I be added to the quarry horizon project, to enable ingress traffic from the prometheus hosts? Thanks [18:23:56] T360220: Scrape prometheus metrics from Quarry - https://phabricator.wikimedia.org/T360220 [18:28:22] brouberol: what's the port+IP? I can add the rule. [18:31:56] Assuming the quarry-web will see the actual host IP, and not a NAT-ed IP, that'd be 10.64.16.62, port 443 [18:32:01] thank you! [18:33:07] oh, that's a prod IP? I don't think that will work at all -- totally different networks. [18:34:31] It sounds like on that task you maybe decided not to do this at all, but if you actually want contact between prod prometheus and a VM the route will not be obvious; best to open a task and see if it's possible. [18:36:12] gotcha, that might explain why we expose prom metrics but not have them in our prometheus data in the first plae [18:36:13] *place [18:36:36] brouberol: just to confirm: did you see my comment in the task too? [18:36:44] as far as I know monitoring of things on cloud-vps is handled via totally different prometheus instances. [18:37:21] and these instances' data is not aggregated with the other ones via thanos? [18:38:01] 'the other ones' meaning prod data? then no. [18:38:15] yes that was what I meant. Thanks, I didn't know that [18:38:23] taavi: I did! I admit that the whole week of accumulated jetlag is making me a bit slow atm :D [18:38:52] I assumed you meant that it _could_ be done, but via openstack SGs [18:40:23] my comment in the task, not in the patch [18:40:42] i would start by replacing my hack at https://github.com/toolforge/quarry/blob/01d3c4a36d6815cdf85bb07a1d9bd4307133fee4/quarry/web/metrics.py#L54 to instead have two endpoints (one for instance-level data and one for app-level). I'm 99% sure that that simple concat in my code not working for the prometheus internal binary format is what's causing [18:40:42] the 'unexpected data' error [18:41:08] ah, that I didn't see [18:43:01] brouberol: I don't object to adding you to quarry btw, just trying to avoid you going down a blind alley :) [18:43:13] andrewbogott: appreciated! [18:43:53] taavi: gotcha, I can have a look. Thanks for the pointers [18:51:43] it seems that the /metrics output can be parsed by `promtool check metrics` though, which would indicate that the output format is correct? [18:52:05] (Anyway, it's not urgent by any length, especially if y'all are at the summit at the moment) [18:52:12] we're not :( [18:52:18] how are you feeding the data to `promtool check metrics`? [18:52:45] Is the desire to have the metrics stored in prometheus, though one not inside of quarry? (there isn't a quarry prometheus currently that I'm aware of) [18:54:16] taavi: I was using https://o11y.tools/metricslint/ that runs locally, and I'm downloading the prometheus CLI to replicate, but the SF office wifi is very slow, so this is taking a while [18:55:13] Rook: I was pairing with Sam Smith, and the need would be to have a Grafana panel with these metrics, to have a better understanding of how Quarry is being used [18:55:48] like https://grafana.wmcloud.org/d/eV0M3UyVk/paws-usage-statistics but with quarry data? [18:56:51] Rook: the prometheus server in the metricsinfra project is already trying to scrape the endpoint, although there is a bug in the application code itself that's causing it to emit the metrics in an invalid format. [18:57:12] taavi: I thought we didn't like to do that? Have metricsinfra scrape projects? [18:57:31] Rook: that was the idea yes [18:58:41] Rook: we don't want to do that when the metric count or cardinality is high enough, which it would be if you were scraping for example metrics about the kubernetes cluster itself. but in this case the app data we want to scrape is rather tiny, so it's fine. see [18:58:41] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus [18:59:16] Oh alrighty, that might be the better way then [18:59:30] taavi: curl -s https://quarry.wmcloud.org/metrics | promtool check metrics -> exit code 0 [18:59:50] brouberol: newer prometheus versions use something called the openmetrics format if both the client and server support it. if you do `curl -H "Accept: application/openmetrics-text" https://quarry.wmcloud.org/metrics` you can see it has an explicit EOF marker just before the quarry_query_runs_per_status which prometheus does not like [18:59:55] If that doesn't work out quarry should be going into k8s soon (or vanishing from existence) at which point the paws prometheus code could largely be copied into quarry [19:00:47] ah! Indeed, and when I curl with this header and pipe that to `promtool check metrics` I see [19:00:47] flask_http_request_exceptions_total no help text [19:00:47] flask_http_request_total no help text [19:01:29] and indeed, I see the #EOF [19:04:59] alright gents, that has given me a lot of food for thoughts. I'm probably not going to focus on that today, but I appreciate your help and guidance [21:07:38] s/gents/folks