[07:39:07] very good re: the video 10/10 [08:12:43] paravoid: you just gave me a reason to laugh [09:48:27] loool @ that youtube video <3 [11:10:02] cross-posting from -operations: can someone remove mw2383 from scap? It's depooled from user traffic, but it's still in scap, even though it always timeouts (so once it's ready for pooling, you'd have to do a scap pull anyway). [11:37:32] there are uncommitted changes in the private repo [11:38:03] I need to add a cert but don't want to touch those hieradata changes [11:42:07] mutante, I was working on a private patch [11:42:56] apparently, hitting control-c while editing the commit message commits the changes [11:44:00] jynus: ah, just saw the email. if it's ok with you that it is committed..it's ok with me.. I had added my new cert with git add [11:45:35] if you need to change the hieradata, go ahead and make another change, i got what I needed, and thank you [13:54:55] hi all, earlier in the week i did a one on one session wit topranks to go through the basics of adding a new module to the puppet control repo. I think the flow when quite well and as i forgot to record it i thught i would try and get as much of the descussion down as possible and try to get it in more of a flow of how we build up the pices of code. To that end i have create a tutorial git hub repo [13:55:01] where each commit introduces a new pice of code or ... [13:55:03] ... puppet concept untill ad the end we have a pretty full working module profile and role, with acoumapnying rspec tests and hiera data. I have added it to git so that useres can just walk the commit tree and view the various diffs at each stage, however git may not be the best tool for this (for instance rwriteing the commit history every time we want to make a change to one of the ealier [13:55:09] "lessons" may turn out to be a PITA. anyway for nwo ... [13:55:12] ... consider it a PoC if its usefull we can move it somewhere a bit easier to edit. [13:55:15] you can view the commits here https://github.com/b4ldr/wmf-puppet-tutorial/commits [13:56:12] each commit is heavely commented to try and explain all the concepts well as they are iontroduced, further the later commits talk about some more advanced fatures such as lookup_options so this could even be usefull to more then just new commers [13:56:52] jbond: ohhh this looks so good [13:56:55] what a great idea [13:57:01] be warned i did a lot of this work in the early hours of this morning so there are likle some typos :) [13:57:12] hey folks, I added some comments in #operations, there seems to be an issue with api-appservers in codfw [13:57:28] thanks rzl [13:57:31] oh that's only codfw, it's no big dea-- oh right [13:57:33] having a look [13:57:34] latency changed from this morning around 7AM UTC, and now we are getting some busy appserver alert [13:57:54] afaics it doesn't seem to be traffic-related (I mean increase in traffic) [13:58:13] (topranks: would be keen to get your feedback on this and ensure it captures the good bits of what we went over) [13:58:29] * jbond goes back to into the backgound [14:00:04] side note - I don't see a latency specific alert but we should have one [14:00:39] 95th percentile is around 2s right now, and avg is doubled/tripled [14:00:45] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-24h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [14:22:45] jbond: amazing stuff, just glancing through it seems to be a really smart approach, certainly touches on all the main points, but I will go through in more detail and feed back [14:22:48] nice one! [14:22:58] now go back to your holiday!! [14:23:24] git log -p --reverse is a simple way to walk trhough it ;) [14:27:02] topranks: great thanks and will do ;) [14:28:37] godog: yt? [14:28:47] got a q about https://gerrit.wikimedia.org/r/c/operations/puppet/+/556631 and the prometheus query there [14:29:06] i'm updating eventgate to use prometheus directly, ditching the statsd sidecar bridge [14:29:10] so the metric has changed [14:29:18] express_router_request_duration_seconds_bucket [14:29:28] but, i haven't been able to make that query format work [14:29:33] i'm probably doing somethign wrong though [14:32:06] ottomata: hey, what have you tried so far ? [14:32:51] so, the new metrics are live for eventgate-analytics in eqiad k8s staging [14:33:04] trying something like method:express_router_request_duration_seconds:90pct5m{service="eventgate-analytics"} [14:33:12] assuming that 'service_method' was the previous label [14:33:17] and 'method' is the new label [14:33:31] am using thanos [14:33:37] but maybe i should ttarget the eqiad staging prometheus [14:34:22] ottomata: got it, by convention if the metric name has columns in it then it comes from a recording rule, likely we'll need new recording rules for the express router metrics [14:34:30] ottomata: i.e. modules/profile/files/prometheus/rules_k8s.yml [14:34:46] s/columns/colons/ [14:35:11] oh wow [14:35:12] or "columns spelt with a :" [14:35:13] heheh indeed [14:35:25] this is a custom metric -> uhhhh some otherr metric transformer? [14:35:53] no wonder i couldn't make it work. I just edited the new dashboard to use the queries directly [14:35:59] coudl we do that here instead of using a transform? [14:36:16] histogram_quantile(0.5, sum by(method, le, service) (rate(service_runner_request_duration_seconds_bucket[5m]))) [14:37:08] what are these for? just shortcuts for convenience? or more like precomputed queries for performance reasons? [14:37:34] ottomata: the latter, pre computed for performance reasons, you can try the non-precomputed and see how fast it is though [14:38:37] https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate-new-prometheus-wip?orgId=1&refresh=1m&from=1626269911501&to=1626273511502&var-service=eventgate-analytics&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:38:51] seems ok here, but there are only 2 pods in this case [14:38:59] not sure if that matters [14:39:23] i guess i'll add a rule to that file for this new metric [14:39:26] just a metric name change really [14:39:57] yeah that works too, as more applications move to the express router native prometheus I'm reasonably sure the dashboard will get slower [14:40:06] feel free to send the review my way [14:41:45] ok [14:44:31] btullis: FYI more prometheus work I found ^^^ [14:44:54] apparently there are few 'precomputed prometheus queries' [14:45:00] declared in puppet [14:45:07] and some dashboards and alerts use them [14:45:15] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704548 [14:46:22] LGTM! [14:47:19] godog: anything to do to get those picked up? [14:47:59] ottomata: will be auto deployed at next puppet run [14:48:09] ok [14:56:03] godog: what prometheus nodes scrape eqiad k8s staging? [14:59:06] ottomata: prometheus100[34] [14:59:39] to access UI directly... [14:59:42] http://localhost:9900/k8s-staging/graph [14:59:42] ? [15:00:30] godog: ^ ? [15:00:40] ottomata: different port, 9900 is ops [15:00:49] sorry busy atm [15:00:55] ah ty will find [15:16:25] !bash (during dc switchover month) rzl> oh that's only codfw, it's no big dea-- oh right [15:16:25] Amir1: Stored quip at https://bash.toolforge.org/quip/Ac2XpXoB1jz_IcWuKYwJ [15:30:02] it works! thanks godog :) [15:31:06] ottomata: \o/ \o/ awesomesauce, you're welcome [15:32:09] jbond: very nice re: puppet repo, will take a look too [15:42:20] topranks: for when you have time https://gerrit.wikimedia.org/r/c/operations/puppet/+/704559/ [15:42:42] just please make sure puppet has been ran on netmon* since yesterday so the crons are absente [15:42:45] *absented [15:50:52] Anyone know how to, in a commit message, have a `Hosts:` line that's longer than a single line, without screwing up the parsing? (not sure what the terminology is for what we call `Hosts:`, `Bug:`, `Change-id:` etc) [15:51:15] (A simple newline breaks the parsing and not putting a newline means that the linter complains about line length, naturally) [15:51:38] not aware of any workaround [15:51:52] our commit message linter does more harm than good anyway... [15:54:00] ryankemper: IIRC you can use the NodeSet syntax used by cumin for example, if that helps to shorten the line ;) [15:54:43] i.e. mw10[22-25].eqiad.wmnet for example [15:56:45] volans: thanks that might be barely enough to scrape by...I've been taking the "one of every type" approach (1 eqiad elastic, 1 codfw elastic, 1 relforge, etc) but I want both of `cloudelastic100[1,6]` so that might let me get the character count down low enough [15:57:09] * volans finger crossed [15:57:38] I didn't check the code, but wondering if you put multiple Hosts: lines what happen, first/last win? [15:59:02] or we simply remove the silly line length check and the various other (entirely undocumented to any new user!) restrictions it enforces [15:59:38] basically one can commit arbitrary stuff as the commit message, but behold of a blank line at the wrong (and entirely undocumented) place... [16:02:36] ryankemper: you can have multiple Hosts: lines, fwiw [16:03:11] the nodeset approach has the unfortunate downside that you can only specify a single 'set' [16:03:12] kormat: :O [16:03:23] I'm with moritzm on lifting the lines restrictions [16:03:48] ryankemper: e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/701335 [16:03:50] that should make things easier! and also adding my support for moritzm's pov as well :P [16:04:21] i also have a CR out for review to allow you to add comments to those lines [16:04:46] (https://gerrit.wikimedia.org/r/c/integration/config/+/701370) [16:06:23] "Nobody needs comments! They will only be abused!" The JSON guy [16:07:02] :P [16:07:25] Sometimes I feel like she should have to deal with entirely comment-less config files and source code for a year. [16:38:38] RE the issue earlier today where elastic* hosts were experiencing too much IO load, there was an issue with the systemd timer we'd set up where the timer was a one-and-done instead of firing every 30 minutes as desired. this fixes that if anyone feels comfortable reviewing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/704567 [16:39:46] In typical fashion when working with systemd timers, the change itself is 4 bytes and the associated commit message is 1000 bytes :D [17:16:54] godog: The latest revision of this video tells me "File is corrupt" and unable to play it (Firefox). https://wikitech.wikimedia.org/wiki/File:Pontoon_demo_graphite_buster.ogv [17:17:08] the original version linked in the history works for me though [18:31:30] Pchelolo: thoughts on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/704588/ ? [18:31:37] is there a reason not to use num_workers: 0 ? [18:33:33] akosiaris: had some strong opinions on that. Basically, if a worker fails it's much easier to restart the worker then restart a pod. Plus service runner has some fancy features like heartbeats, automatic killing a worker if memory limits are exceeded - none of the features work with num_workers: 0 [18:33:48] cause we don't want to kill ourselves in the master process [18:34:00] hmm [18:45:53] ok bad workaroudn then [18:45:55] thanks [19:52:23] cwhite: if you are still around and got a moment, would appreciate some help debugging this [19:52:51] it looks like somehow the master process is getting two worker messages with the same requestId [20:43:21] ok no, i got it! [20:43:36] cwhite: it was the multiple requires of prom-client [20:43:41] that is pretty bad on prom-clients fault [20:43:43] too much global stuff!