[08:35:46] _joe_: the next awesome patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/725261 [09:15:34] hi, so I sometime have an alarm about "contint2001.mgmt/SSH" for which I filed a task back in May. I am wondering whom / which team I should poke about it to get it investigated and addressed? It looks like some network is flapping https://phabricator.wikimedia.org/T283582 [09:15:49] it is the management network so not the end of the world, but those alarmes are a bit annoying ;) [09:19:45] <_joe_> i find the daily alerts about zull queue being full more annoying, but ymmv [09:20:46] <_joe_> but in this case you might want to ask top.ranks or X.ionox for some attention [09:21:07] <_joe_> oh apparently not, those are unmanaged switches [09:21:15] <_joe_> ok, I wouldn't really sweat it then [09:27:33] I will add them ;) [09:28:43] <_joe_> hashar: as arzhel already said on the task, those are unmanaged switches so not much they can do [09:29:07] <_joe_> dcops is a better option [09:29:26] ah , so they got drop the unmanaged switch possibly [09:29:31] <_joe_> https://phabricator.wikimedia.org/T283582#7111164 [09:30:03] <_joe_> the poimt is we have no visibility remotely on what might not work [09:31:54] I did some change to the task, we will see whether it get noticed :-] [09:32:09] the zuul queue alarm, let me check it and tweak it now [09:32:13] it is indeed spammy [09:33:49] PSA: if you have zuul queue metrics in prometheus consider also migrating to alerts.git / alertmanager, so you can self-manage among other things [09:36:03] zuul metrics in this case but any metrics really in general [09:38:03] can we get Icinga to check an alert that is defined in a Grafana dashboard? [09:38:54] only alertmanager not icinga, but yes [09:39:45] I personally prefer alerts revision controlled in git, but we support grafana dashboards too [09:39:49] the dashboard at https://grafana-rw.wikimedia.org/d/000000322/zuul-gearman?tab=alert&editPanel=10&orgId=1 has an alert threshold with a notification send to `cxserver` [09:39:55] https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts [09:40:19] but the notification comes from a monitoring::graphite_threshold defined in Puppet [09:40:25] also this if your team hasn't been onboarded to alertmanager [09:40:26] https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do? [09:41:14] ton of documentation, that is great ;) [09:41:39] I guess I can drop the monitoring::graphite_threshold bit and migrate to grafana alert + alertmanager notification [09:42:26] if the metrics are in graphite only, which I'm assuming it is the case, then yes [09:43:49] ah there we go, for zuul/prometheus that'd be T233089 [09:43:49] T233089: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 [09:53:47] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/725290 will make the zuul queue alarm quieter ( I have looked at the dashboard, the new threshold should be fine) [09:54:26] <_joe_> hashar: I've also noticed zuul gets a signficant queue size every day around 8 pm our time [09:55:47] in the last few days I think it was related to mediawiki security release [09:55:47] s [09:56:05] <_joe_> ok so organic traffic and we know why it happens [09:56:08] which typically send a few dozen of patches each triggering a couple dozen of jobs [09:56:12] yeah [09:56:12] <_joe_> yeah [09:56:42] the intent of this monitoring probe is to fire when there are wayyy too many function waitings. For example when CI stall and does not process anything [09:56:47] which sometime happens [09:57:48] godog: does your team happen to have some presentation / training material you could give to releng/engproductivity? I imagine a short pres to every members might be a good thing [09:58:08] (I have never heard of alert manager until today, but I am lagging inc atching up with our infra so maybe it is just me) [09:59:15] actually I did since alerts.wikimedia.org is a purple link in my browser so I must have clicked it at some point in the past bah... I am getting old [10:00:12] hashar: AM is cool, see https://wikitech.wikimedia.org/wiki/Alertmanager and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/%2B/refs/heads/master :) [10:00:26] hashar: yeah what ema said :) [10:00:41] sounds easier to have someone knowledgeable to give us an overview ;D [10:03:25] I'm searching for the ops session [10:04:41] oh even better, then I can relay it to our mailing list / Slack and people will know [10:05:23] I can't find it, perhaps it wasn't recorded [10:11:32] so yeah if you are interested in alertmanager I can give a 30 min presentation [10:14:41] godog: if you have one ready, I am pretty sure it will be appreciated [10:14:48] I am guessing performance might like to attend as well [10:15:48] I am pretty sure Tyler and Leo already talked about doing some cross team exchanges and that one might be a good start [10:17:39] indeed [10:18:06] I'll bring the presentation up with o11y [10:20:49] godog: do you want me to email perf/releng managers and loop in you and leo? [10:23:28] nah that's fine hashar, thank you though [10:24:05] I am proposing since I am eager to see more team cross connections, and doing presentations of stuff we do to other teams might be a good way to achieve that [10:24:29] anyway, going to finish that Alert manager doc. It is a gold mine of informations, thanks! [10:26:03] and if that alert manager works, I will drop the other graphite based monitoring probe which is alerting as wel [10:28:54] oh [10:29:03] and it can create Phabricator tasks automatically ( https://phabricator.wikimedia.org/p/phaultfinder/ ) [10:29:23] imagine if we had tasks automatically filed for new mediawiki errors appearing after a deployment [10:30:03] <_joe_> hashar: I'm pretty sure we'd overflow maxint for the task numbers too quickly :P [10:31:05] the devil is deduplicating the stream of logs :\ [12:58:15] For perf team we check our own logs once a week and fix things as they come up so most cases there's only 0-3 normalised errors and no need to create any filters [12:59:17] The filtering is something I intended as a temporary work around to make the fallback triage more workable especially since so many teams take months to even look into it [13:00:04] Duplicates are fairy rare when developers file tasks for errors in an area they are familiar with [14:47:18] effie: akosiaris: in case one of you is running k8s benches, I just noticed in todays' flame graphs that about 10% of index.php on k8s is spent in BadRequestError, which may or may not be known already, just FYI :) [14:47:32] Krinkle: yeah that was me [14:47:45] apparently the list YOU gave me last year is full of 400s :p [14:48:02] I have two lists, I will do another run to clean them up and keep the 200s [14:48:11] and run the last test we were discussing [14:48:16] ah, this is the user-generated urls [14:48:30] yeah the ones from weblog [14:49:04] I have ~800000 urls, but we will see how many will be left after we keep the 200s [14:49:24] I can make a new capture as well. [14:50:17] e.g. 100K recent URLs that are appserver-ish, GET, 200, and filter out same as before all the seemingly non-immutable/garbage ones based on known query parameters. [14:51:23] if you can get me another 500k 200s that would be great [14:51:43] will save me time to weed out urls from my current captures [14:54:50] Krinkle: ping me when you have time to do so [14:57:10] effie: I'll do it in the next hour. One special url sample with extra garlic coming up. [14:57:23] or 500k of those :) [14:57:25] tx tx [14:59:42] effie: will the endpoint you're hitting be before apache rewrites and so things like /w/skins/Vector/resources/skins.vector.styles.legacy/images/user-avatar.svg will work and end up executing /w/static.php? [15:00:51] I will be hitting the apache ports directly [15:01:05] actually, the envoy tls terminators [15:02:54] ok [15:07:46] puppet enthusiasts: I have a bunch of "exec" resources I'd like to get notified if they fail, I'm ok to fail the catalog compilation if that happens, is that possible? [15:08:05] modules/alerts/manifests/deploy/prometheus.pp specifically [15:13:49] mmhh or delegate the work to systemd oneshot units [15:14:06] <_joe_> yeah I don't think we get notified if an exec fails [15:14:24] <_joe_> the puppet way would be to create a resource [15:14:37] we don't, but I'd be ok to fail the rul [15:14:38] run [15:19:43] but yeah it seems the writing on the wall is "systemd unit" [15:25:23] effie: I dont' remember what we did last time exactly. I have the query figured out, except for one last step: Filtering source/cache status. The simplest is to do nothing which means any N urls that I consider as "Safe MW route GET with 200/304 resp", this means we include possibly overrpresented urls that would generally not reach the backend that often. Alternatively I could filter for pass/miss only and let it run for a few more [15:25:23] minutes to accumulate enough URLs, or perhaps only keep unique URLs and skew it even further. [15:25:57] I do not mind about cache status really [15:26:06] caches expire, get invalidated etc [15:26:16] about 30% will be the same enwiki default load.php URLs that all page views have, for example. [15:26:48] oh I see [15:27:12] I can give you all, or one, or as many as we see going to backend layer [15:27:40] the latter then [15:27:44] ok :) [15:27:49] we could also check apache logs ofc [15:27:54] it that would make things easier [15:28:16] $ kafkacat -C -b kafka-jumbo1005.eqiad.wmnet,kafka-jumbo1004.eqiad.wmnet,kafka-jumbo1006.eqiad.wmnet,kafka-jumbo1003.eqiad.wmnet -t webrequest_text | fgrep '"http_method":"GET"' | grep -E '"uri_path":"/(w|wiki)/' | grep -E '"http_status":"(200|304)"' | grep -v -E '[?&]token=|CentralAutoLogin' | head | jq -r .uri_host+.uri_path+.uri_query [15:28:21] This is more or less what I do currently [15:28:37] which takes only a few seconds to get what we need once I add `-o -15000000 -c15000000` to the kafkacat command [15:29:43] (from stats1004) [15:29:47] or stat1007 [15:30:09] cool cool [19:42:24] hello SRE folks. been searching wikitech for something definitive but only finding examples. Do we have guidelines for key length and cipher preference for ssh-keygen? [19:43:26] tltaylor: there are some notes here: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key [19:43:55] yeah, that page is canonical I think -- it's expressed as an example but that's just for ease of use [19:44:00] the recommendation is to generate ed25519 keys, or ssh-keygen -t ed25519 [19:44:11] ok noted, thank you