[08:35:46] <Amir1>	 _joe_: the next awesome patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/725261
[09:15:34] <hashar>	 hi, so I sometime have an alarm about "contint2001.mgmt/SSH" for which I filed a task back in May. I am wondering whom / which team I should poke about it to get it investigated and addressed?  It looks like some network is flapping https://phabricator.wikimedia.org/T283582
[09:15:49] <hashar>	 it is the management network so not the end of the world, but those alarmes are a bit annoying ;)
[09:19:45] <_joe_>	 i find the daily alerts about zull queue being full more annoying, but ymmv
[09:20:46] <_joe_>	 but in this case you might want to ask top.ranks or X.ionox for some attention
[09:21:07] <_joe_>	 oh apparently not, those are unmanaged switches
[09:21:15] <_joe_>	 ok, I wouldn't really sweat it then
[09:27:33] <hashar>	 I will add them ;)
[09:28:43] <_joe_>	 hashar: as arzhel already said on the task, those are unmanaged switches so not much they can do
[09:29:07] <_joe_>	 dcops is a better option
[09:29:26] <hashar>	 ah , so they got drop the unmanaged switch possibly
[09:29:31] <_joe_>	 https://phabricator.wikimedia.org/T283582#7111164
[09:30:03] <_joe_>	 the poimt is we have no visibility remotely on what might not work
[09:31:54] <hashar>	 I did some change to the task, we will see whether it get noticed :-]
[09:32:09] <hashar>	 the zuul queue alarm, let me check it and tweak it now
[09:32:13] <hashar>	 it is indeed spammy
[09:33:49] <godog>	 PSA: if you have zuul queue metrics in prometheus consider also migrating to alerts.git / alertmanager, so you can self-manage among other things
[09:36:03] <godog>	 zuul metrics in this case but any metrics really in general
[09:38:03] <hashar>	 can we get Icinga to check an alert that is defined in a Grafana dashboard?
[09:38:54] <godog>	 only alertmanager not icinga, but yes
[09:39:45] <godog>	 I personally prefer alerts revision controlled in git, but we support grafana dashboards too
[09:39:49] <hashar>	 the dashboard at https://grafana-rw.wikimedia.org/d/000000322/zuul-gearman?tab=alert&editPanel=10&orgId=1  has an alert threshold with a notification send to `cxserver`
[09:39:55] <godog>	 https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts
[09:40:19] <hashar>	 but the notification comes from a  monitoring::graphite_threshold defined in Puppet
[09:40:25] <godog>	 also this if your team hasn't been onboarded to alertmanager
[09:40:26] <godog>	 https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do?
[09:41:14] <hashar>	 ton of documentation, that is great ;)
[09:41:39] <hashar>	 I guess I can drop the monitoring::graphite_threshold bit and migrate to grafana alert + alertmanager notification
[09:42:26] <godog>	 if the metrics are in graphite only, which I'm assuming it is the case, then yes
[09:43:49] <godog>	 ah there we go, for zuul/prometheus that'd be T233089
[09:43:49] <stashbot>	 T233089: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089
[09:53:47] <hashar>	 _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/725290 will make the zuul queue alarm quieter ( I have looked at the dashboard, the new threshold should be fine)
[09:54:26] <_joe_>	 hashar: I've also noticed zuul gets a signficant queue size every day around 8 pm our time
[09:55:47] <hashar>	 in the last few days I think it was related to mediawiki security release
[09:55:47] <hashar>	 s
[09:56:05] <_joe_>	 ok so organic traffic and we know why it happens
[09:56:08] <hashar>	 which typically send a few dozen of patches each triggering a couple dozen of jobs
[09:56:12] <hashar>	 yeah
[09:56:12] <_joe_>	 yeah
[09:56:42] <hashar>	 the intent of this monitoring probe is to fire when there are wayyy too many function waitings. For example when CI stall and does not process anything
[09:56:47] <hashar>	 which sometime happens
[09:57:48] <hashar>	 godog: does your team happen to have some presentation / training material you could give to releng/engproductivity?  I imagine a short pres to every members might be a good thing
[09:58:08] <hashar>	 (I have never heard of alert manager until today, but I am lagging inc atching up with our infra so maybe it is just me)
[09:59:15] <hashar>	 actually I did since alerts.wikimedia.org is a purple link in my browser so I must have clicked it at some point in the past bah... I am getting old
[10:00:12] <ema>	 hashar: AM is cool, see https://wikitech.wikimedia.org/wiki/Alertmanager and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/%2B/refs/heads/master :)
[10:00:26] <godog>	 hashar: yeah what ema said :)
[10:00:41] <hashar>	 sounds easier to have someone knowledgeable to give us an overview ;D
[10:03:25] <godog>	 I'm searching for the ops session
[10:04:41] <hashar>	 oh even better, then I can relay it to our mailing list / Slack and people will know
[10:05:23] <godog>	 I can't find it, perhaps it wasn't recorded
[10:11:32] <godog>	 so yeah if you are interested in alertmanager I can give a 30 min presentation
[10:14:41] <hashar>	 godog: if you have one ready, I am pretty sure it will be appreciated
[10:14:48] <hashar>	 I am guessing performance might like to attend as well
[10:15:48] <hashar>	 I am pretty sure Tyler and Leo already talked about doing some cross team exchanges and that one might be a good  start
[10:17:39] <godog>	 indeed
[10:18:06] <godog>	 I'll bring the presentation up with o11y
[10:20:49] <hashar>	 godog: do you want me to email perf/releng managers and loop in you and leo?
[10:23:28] <godog>	 nah that's fine hashar, thank you though
[10:24:05] <hashar>	 I am proposing since I am eager to see more team cross connections, and doing presentations of stuff we do to other teams might be a good way to achieve that
[10:24:29] <hashar>	 anyway, going to finish that Alert manager doc. It is a gold mine of informations, thanks!
[10:26:03] <hashar>	 and if that alert manager works, I will drop the other graphite based monitoring probe which is alerting as wel
[10:28:54] <hashar>	 oh
[10:29:03] <hashar>	 and it can create Phabricator tasks automatically ( https://phabricator.wikimedia.org/p/phaultfinder/ )
[10:29:23] <hashar>	 imagine if we had tasks automatically filed for new mediawiki errors appearing after a deployment
[10:30:03] <_joe_>	 hashar: I'm pretty sure we'd overflow maxint for the task numbers too quickly :P
[10:31:05] <hashar>	 the devil is deduplicating the stream of logs :\
[12:58:15] <Krinkle>	 For perf team we check our own logs once a week and fix things as they come up so most cases there's only 0-3 normalised errors and no need to create any filters
[12:59:17] <Krinkle>	 The filtering is something I intended as a temporary work around to make the fallback triage more workable especially since so many teams take months to even look into it
[13:00:04] <Krinkle>	 Duplicates are fairy rare when developers file tasks for errors in an area they are familiar with
[14:47:18] <Krinkle>	 effie: akosiaris: in case one of you is running k8s benches, I just noticed in todays' flame graphs that about 10% of index.php on k8s is spent in BadRequestError, which may or may not be known already, just FYI :)
[14:47:32] <effie>	 Krinkle: yeah that was me 
[14:47:45] <effie>	 apparently the list YOU gave me last year is full of 400s :p
[14:48:02] <effie>	 I have two lists, I will do another run to clean them up and keep the 200s 
[14:48:11] <effie>	 and run the last test we were discussing 
[14:48:16] <Krinkle>	 ah, this is the user-generated urls
[14:48:30] <effie>	 yeah the ones from weblog 
[14:49:04] <effie>	 I have ~800000 urls, but we will see how many will be left after we keep the 200s
[14:49:24] <Krinkle>	 I can make a new capture as well.
[14:50:17] <Krinkle>	 e.g. 100K recent URLs that are appserver-ish, GET, 200, and filter out same as before all the seemingly non-immutable/garbage ones based on known query parameters.
[14:51:23] <effie>	 if you can get me another 500k 200s that would be great 
[14:51:43] <effie>	 will save me time to weed out urls from my current captures
[14:54:50] <effie>	 Krinkle: ping me when you have time to do so 
[14:57:10] <Krinkle>	 effie: I'll do it in the next hour. One special url sample with extra garlic coming up.
[14:57:23] <effie>	 or 500k of those :)
[14:57:25] <effie>	 tx tx 
[14:59:42] <Krinkle>	 effie: will the endpoint you're hitting be before apache rewrites and so things like /w/skins/Vector/resources/skins.vector.styles.legacy/images/user-avatar.svg will work and end up executing /w/static.php?
[15:00:51] <effie>	 I will be hitting the apache ports directly 
[15:01:05] <effie>	 actually, the envoy tls terminators
[15:02:54] <Krinkle>	 ok
[15:07:46] <godog>	 puppet enthusiasts: I have a bunch of "exec" resources I'd like to get notified if they fail, I'm ok to fail the catalog compilation if that happens, is that possible?
[15:08:05] <godog>	 modules/alerts/manifests/deploy/prometheus.pp specifically
[15:13:49] <godog>	 mmhh or delegate the work to systemd oneshot units
[15:14:06] <_joe_>	 yeah I don't think we get notified if an exec fails
[15:14:24] <_joe_>	 the puppet way would be to create a resource
[15:14:37] <godog>	 we don't, but I'd be ok to fail the rul
[15:14:38] <godog>	 run
[15:19:43] <godog>	 but yeah it seems the writing on the wall is "systemd unit"
[15:25:23] <Krinkle>	 effie: I dont' remember what we did last time exactly. I have the query figured out, except for one last step: Filtering source/cache status. The simplest is to do nothing which means any N urls that I consider as "Safe MW route GET with 200/304 resp", this means we include possibly overrpresented urls that would generally not reach the backend that often. Alternatively I could filter for pass/miss only and let it run for a few more 
[15:25:23] <Krinkle>	 minutes to accumulate enough URLs, or perhaps only keep unique URLs and skew it even further.
[15:25:57] <effie>	 I do not mind about cache status really 
[15:26:06] <effie>	 caches expire, get invalidated etc 
[15:26:16] <Krinkle>	 about 30% will be the same enwiki default load.php URLs that all page views have, for example.
[15:26:48] <effie>	 oh I see
[15:27:12] <Krinkle>	 I can give you all, or one, or as many as we see going to backend layer
[15:27:40] <effie>	 the latter then 
[15:27:44] <Krinkle>	 ok :)
[15:27:49] <effie>	 we could also check apache logs ofc 
[15:27:54] <effie>	 it that would make things easier
[15:28:16] <Krinkle>	 $ kafkacat -C -b kafka-jumbo1005.eqiad.wmnet,kafka-jumbo1004.eqiad.wmnet,kafka-jumbo1006.eqiad.wmnet,kafka-jumbo1003.eqiad.wmnet -t webrequest_text | fgrep '"http_method":"GET"' | grep -E '"uri_path":"/(w|wiki)/' | grep -E '"http_status":"(200|304)"' | grep -v -E '[?&]token=|CentralAutoLogin' | head | jq -r .uri_host+.uri_path+.uri_query
[15:28:21] <Krinkle>	 This is more or less what I do currently
[15:28:37] <Krinkle>	 which takes only a few seconds to get what we need once I add `-o -15000000 -c15000000` to the kafkacat command
[15:29:43] <Krinkle>	 (from stats1004)
[15:29:47] <Krinkle>	 or stat1007
[15:30:09] <effie>	 cool cool 
[19:42:24] <tltaylor>	 hello SRE folks. been searching wikitech for something definitive but only finding examples. Do we have guidelines for key length and cipher preference for ssh-keygen?
[19:43:26] <sukhe>	 tltaylor: there are some notes here: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key
[19:43:55] <rzl>	 yeah, that page is canonical I think -- it's expressed as an example but that's just for ease of use
[19:44:00] <sukhe>	 the recommendation is to generate ed25519 keys, or ssh-keygen -t ed25519
[19:44:11] <tltaylor>	 ok noted, thank you