[08:07:16] <zpapierski>	 Yay, back in 2022, may it be merry for you all!
[08:25:01] <dcausse>	 o/
[08:25:07] <dcausse>	 happy new year!
[08:39:42] <dcausse>	 I think I'll give up on upgrading spark just to get rid of this very old guava dep, it does not seem worth it after all
[10:42:37] <gehel>	 lunch
[10:42:52] <zpapierski>	 errand (back in 2.5-3h)
[11:37:24] <dcausse>	 lunch
[13:56:48] <inflatador>	 Greetings all, and welcome back zpapierski !
[14:04:52] <zpapierski>	 inflatador: great seeing you here (I'm quite aware that you have been here the whole time :) - looking forward to our 1/1 !
[14:05:46] <zpapierski>	 super late lunch (almost dinner even)
[14:06:01] <inflatador>	 For sure, should be fun
[14:07:26] <inflatador>	 Can anyone help me merge this guy? https://gerrit.wikimedia.org/r/c/operations/alerts/+/751513/  . I have "rebase" but no "submit" button . Looking at the Gerrit tutorial on MW, I might need a "+2"?
[14:08:04] <gehel>	 inflatador: you probably don't have +2 access to this repo yet. Let me see if I can find who has the right to add you there
[14:09:10] <taavi>	 inflatador: do you see a "+2" button if you click reply?
[14:09:15] <gehel>	 inflatador: it looks like the ldap/ops group is the owner: https://gerrit.wikimedia.org/r/admin/repos/operations/alerts,access
[14:09:38] <taavi>	 in theory it should come with the ops ldap group, so if you don't try logging out and back in so it refreshes the group cache
[14:10:08] <inflatador>	 taavi when I click "reply", there IS a code-review button and "+2" is an option. Do I need to give my own code a "+2" before I can merge?
[14:10:10] <gehel>	 we might have missed the step to add you to that ldap group
[14:10:35] <taavi>	 yeah, +2 is referring to that button
[14:10:54] <inflatador>	 OK, let me give that a shot
[14:11:06] <gehel>	 inflatador: Oh yes, the gerrit teminology is confusing. Code-Review +2 means that you're OK to merge
[14:11:35] <taavi>	 so in most repositories, giving a Code-Review +2 is going to start Jenkins jobs on that commit, and if they pass, it'll get merged
[14:12:18] <taavi>	 operations/puppet is the notable exception (and there are a few others), where you need to manually "submit" it as well after giving it a +2
[14:12:25] <gehel>	 there is a whole rule engine (Prolog based if I remember correctly) to define what are the pre-requisite to a merge. For most repos we need Verified +2 (meaning that CI is OK to merge) and Code-Review +2, which means that a human agrees to merge.
[14:12:40] <inflatador>	 Looks like that's happening now. And yes, giving my own code a "+2" is a little unintuitive
[14:13:08] <gehel>	 For most repositories, merging your own code is viewed as bad Karma.
[14:13:55] <inflatador>	 Yeah, it's a bad practice in general, but I guess it's OK in (very limited) circumstances like this
[14:14:02] <taavi>	 in the mediawiki world, +2 (not +1) is given by the reviewer when they think the patch is good and can be merged and deployed in the next "train" (a weekly thing where new mediawiki code is deployed)
[14:14:13] <gehel>	 SRE related repositories are the exception, since in a lot of cases we want the owner of the change to be around when it is deployed (and puppet needs some additional steps)
[14:14:38] <gehel>	 Never, ever merge a change without someone else giving you a +1.
[14:15:33] <gehel>	 Even in case of emergency, having a second pair of eyes reviewing what you're doing in a moment of stress can cost much less time than merging something too hastily.
[14:15:35] <inflatador>	 If I ever even think of doing that, feel free to fly over here and slap me
[14:16:30] <gehel>	 All rules are meant to be broken, but make sure you understand why the rule is there before breaking it.
[14:16:54] <gehel>	 inflatador: does your browser support POIPAAS?
[14:16:55] <gehel>	 https://poipaas.com/
[14:18:12] <inflatador>	 I saw the POIPAAS truck in my neighborhood the other day. I'm sure we'll have it soon!
[14:54:23] <ejoseph>	 errand
[14:56:58] <zpapierski>	 dcausse: anything interesting happened with wcqs updater when I was away?
[15:07:50] <dcausse>	 zpapierski: Erik ran a quick test of the updater and it looked promising and this morning I deployed a patch to increase the capacity of the session clusters in k8s
[15:08:20] <zpapierski>	 on staging or already on eqiad/codfw?
[15:08:40] <zpapierski>	 (I'm mean the test)
[15:09:02] <dcausse>	 so the test was on prod data but ran in yarn
[15:09:09] <zpapierski>	 I see
[15:09:22] <zpapierski>	 anyway, cool :)
[15:10:57] <dcausse>	 there's this patch https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/745629 pending
[15:11:10] <zpapierski>	 yep, I'm looking at it rn
[15:33:49] <dcausse>	 just realized that https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/rdf-spark-tools/src/main/scala/org/wikidata/query/rdf/spark/EntityRevisionMapGenerator.scala needs to be adapted for wcqs
[15:33:51] <dcausse>	 and it's not scheduled by airflow like the one for wikidata
[15:38:20] <dcausse>	 filed T298622 for whoever has time to work on it
[15:38:21] <stashbot>	 T298622: Adapt EntityRevisionMapGenerator for wcqs - https://phabricator.wikimedia.org/T298622
[15:40:40] <inflatador>	 gehel or anyone else, what is the proper list for discussing this type of changes with the general SRE group? https://phabricator.wikimedia.org/T298570
[15:42:23] <zpapierski>	 I'd say the general SRE group is basically a definition of #wikimedia-sre channel
[15:43:21] <gehel>	 inflatador: or if you want to have a more permanent trace, the ops@ mailing list (https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/)
[15:49:41] <inflatador>	 Thanks all, will hit the list first to have a bit more permanent record
[15:51:00] <zpapierski>	 dcausse: are we ok with presenting our lightning talk next week on tech-all? note that unfortunately it conflicts with office hours
[15:51:18] <dcausse>	 zpapierski: fine by me
[15:51:39] <zpapierski>	 ok, I am as well
[15:52:14] <zpapierski>	 although recent wcqs auth controversy might bring some people, but we have a additional meeting to address that, so it's probably ok
[15:53:46] <dcausse>	 not sure... but perhaps, I think this kind of discussion generally stays on phab/wiki
[15:54:00] <zpapierski>	 remember skolemization?
[15:55:00] <dcausse>	 true but I think this is a bit different here, we'll see :)
[15:55:17] <zpapierski>	 I agree it's different, I think it generated even more heat :)
[15:56:49] <Trey314159>	 Hey-o! I survived 90 minutes standing outside in the ~0°C/32°F cold to have a giant q-tip stuck up my nose. Good times!
[15:57:11] <dcausse>	 :)
[15:58:24] <inflatador>	 You're having a peachy keen day so far, I bet
[16:01:41] <ebernhardson>	 \o
[16:01:49] <ebernhardson>	 hmm, no wednesday meeting on my cal?
[16:03:35] <dcausse>	 no meeting week
[16:04:17] <dcausse>	 but happy to jump into one if we feel we need/want it
[16:04:31] <ebernhardson>	 mostly i'm wondering where we stand on wcqs and how we get that moving
[16:05:00] <Trey314159>	 inflatador: double peachy keen, in fact. I can feel my toes now, so all is well.
[16:05:11] <ebernhardson>	 i suppose i can work on the entity revision map, i thought i brought that up to zp early december and he did a test run for wcqs
[16:05:26] <ebernhardson>	 i hadn't looked at what it really does, just that it does something :)
[16:06:01] <dcausse>	 I quickly looked at the code and it's the same problem as usual: how to build this UriScheme
[16:06:21] <dcausse>	 perhaps it worked with a wikidata like UriScheme?
[16:06:30] <zpapierski>	 ebernhardson: I don't jumping in as well, especially that week from now  is an office hour and then I'm gone
[16:07:31] <dcausse>	 looking at hdfs:///wmf/data/discovery/wcqs/entity_revision_map I see two runs indeed
[16:07:49] <dcausse>	 20211107 20211114
[16:08:03] <zpapierski>	 I think I did two runs, actually
[16:08:06] <zpapierski>	 not sure though
[16:08:10] <ebernhardson>	 i was off by a month, close enough i guess :)
[16:08:26] <dcausse>	 zpapierski: do you remember the options you passed?
[16:08:30] <ebernhardson>	 if we just need cli args, thats super easy. 
[16:09:09] <dcausse>	 yes I think that's mostly it, something similar to what we've done for the flink app perhaps
[16:11:24] <dcausse>	 oh it does not even need that since it's using the schema:version which is project agnostic
[16:11:46] <zpapierski>	 I have a script on stat1007
[16:11:59] <zpapierski>	 extract_rev_map.sh in my home dir
[16:12:32] <dcausse>	 well the codebase needs a UriScheme and it's better if it's properly constructed but in this particular case it works with whatever value is passed as --hostname
[16:13:07] <zpapierski>	 huh, I didn't set any additional parameters though
[16:13:33] <gehel>	 inflatador: did you find a way to check that https://gerrit.wikimedia.org/r/c/operations/alerts/+/751513/ was deployed properly?
[16:13:37] <dcausse>	 hostname is optional and is set to wikidata.org as default
[16:14:17] <dcausse>	 hm... it needs proper UriScheme to extract the entity id with urisSchemeProvider().entityURItoId(statement.getSubject.toString)
[16:14:32] <ebernhardson>	 dcausse: can join unmeeting room
[16:14:37] <dcausse>	 ah
[16:15:02] <zpapierski>	 bootstrap is in ./flink-1.13.2/bootstrap.sh
[16:15:07] <gehel>	 inflatador: it looks like alerts.wm.o only shows active alerts
[16:19:59] <inflatador>	 gehel eb helped me find some info in logstash https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=h@75c69a9&_a=h@09865d8 
[16:22:09] <inflatador>	 if you're asking, am I confident the alert will fire when things are bad, but NOT when things are OK? I'm not completely confident. I took down the values from 22 Oct, 30 Dec, and non-alerting times but I don't feel like I grasp everything involved
[16:22:52] <gehel>	 No, I'm asking if you are confident that the alert will trigger on the new 0.03 threshold
[16:23:15] <gehel>	 Are you sure the what you merged has been properly applied on our alerting infrastructure
[16:27:14] <inflatador>	 It was merged, but I haven't checked to see if it was pushed out. Do those configs live on individual nodes, or a prometheus server?
[16:29:35] <gehel>	 I assume it exists on the prometheus server. I'm fairly confident that the deployment is working correctly and that the change has been applied, but I've never done such a change myself and I'm not super familiar with how it actually works.
[16:30:46] <inflatador>	 Let me poke around wikitech and I'll hit up folks in wikimedia-sre if I can't find the prom server
[16:31:28] <gehel>	 looks like that repo is cloned via https://github.com/wikimedia/puppet/blob/production/modules/alerts/manifests/deploy/prometheus.pp and https://github.com/wikimedia/puppet/blob/production/modules/alerts/manifests/deploy/thanos.pp
[16:33:49] <gehel>	 this ends mapping to the prometheus::pop and thanos::frontend roles
[16:34:18] <gehel>	 we could check one of those nodes to see if the change is there
[16:34:55] <gehel>	 But since this is all going through a simple git::clone, there isn't much room for things to go wrong. I'll assume it's all working fine.
[16:35:21] <inflatador>	 I would like to check it, but is there an actual inventory somewhere? Reading https://wikitech.wikimedia.org/wiki/Prometheus ATM
[16:36:42] <gehel>	 I don't see any obvious link in thanos.wm.o
[16:36:56] <gehel>	 don't loose too much time on this. It is most probably all fine!
[16:37:17] <inflatador>	 no worries, I am gaining some valuable context on things if nothing else
[16:40:24] <inflatador>	 if I tcpdump from one of the WQDS servers, looks prometheus2003.codfw.wmnet is scraping it for metrics
[16:41:02] <inflatador>	 prometheus2003.codfw.wmnet has a /srv/alerts/team-search-platform_blazegraph.yaml with a modified date of today, and it matches what I merged earlier
[16:44:43] <gehel>	 looks good!
[17:05:11] <inflatador>	 quick workout, back in ~30
[17:42:47] <inflatador>	 back
[17:42:56] <ebernhardson>	 regarding inventory (list of machines that do things), i don't know that it's how most people do it but usually i find the appropriate puppet class, trace that back to a role, then look in manifests/site.pp of the puppet repo and choose a random machine
[17:43:15] <gehel>	 I do the same!
[17:43:52] <gehel>	 depending on what you mean by inventory or what kind of question you're asking, netbox is a good place
[17:44:26] <inflatador>	 Thanks, adding both methods to my notes. Although it would be pretty cool if we had a service mesh ;)
[17:47:24] <ebernhardson>	 hmm, i ran disable-puppet on wcqs-beta but it still sent a puppet failure email :S
[17:47:50] <ebernhardson>	 ssh'ing in verifies puppet is disabled. I guess the failure email comes from elsewhere
[17:53:05] <gehel>	 the alert probably come from checking the latest puppet report :/
[17:53:10] <gehel>	 dinner time, see you later!
[17:57:56] <taavi>	 yeah, and even if the last run would be successful it would complain if the last run would be too old
[18:52:35] <ebernhardson>	 hmm, did a successfull puppet run on wcqs-beta and then turned it back off. Not sure whats appropriate, the new puppet code (intentionally) can't deploy an un-authenticated wcqs instance. for now turned it back off and adjusted the nginx config back to un-auth'd
[18:54:39] <taavi>	 we (wmcs) prefer to have puppet running, I guess it mostly depends on how long do you imagine it's going to be that way
[18:55:09] <ebernhardson>	 it's being decom'd once the prod service is up, but setting up the prod service is what is making the beta site no longer work
[18:57:45] <inflatador>	 lunch/errand, back in ~45-60m
[18:59:07] * ebernhardson finds it oddly reassuring that moving more_like traffic from codfw->eqiad reduced cluster cpu by 10% in codfw and added 10% in eqiad.
[19:49:39] <inflatador>	 and back
[19:58:14] <ebernhardson>	 now my turn for lunch :)
[20:48:13] <ebernhardson>	 back
[21:42:47] <inflatador>	 Is this the complete list of plugins that have to be updated before we can move to ES 6.8.20? https://phabricator.wikimedia.org/T271777 
[21:47:48] <ebernhardson>	 inflatador: hmm, there is probably another ticket i'll look. For the source of truth i would query one of the clusters, such as https://search.svc.eqiad.wmnet:9243/_cat/plugins | sort
[21:49:46] <ebernhardson>	 inflatador: or i guess, we ship the plugins as a debian package so https://github.com/wikimedia/operations-software-elasticsearch-plugins is another good source of info
[21:52:32] <ebernhardson>	 I suppose the plugins ticket is https://phabricator.wikimedia.org/T294499 but now i wonder if we are using plugins to mean the same thing :)
[21:57:21] <inflatador>	 ebernhardson thanks, I know we will have to do the debian pkg thing too
[22:24:19] <inflatador>	 Still looking at the ES7 upgrade epic ( https://phabricator.wikimedia.org/T263142)  , I'm seeing plugins for elasticsearch, and mediawiki plugin related to ElasticSearch (CirrusSearch) . I'm focusing on compatibility-testing plugins for ES itself as opposed to the MW plugins, let me know if I need to do something else
[22:26:13] <ebernhardson>	 inflatador: that should be correct, i think the confusion is cirrus is an extension (to mediawiki) and you're looking at the elasticsearch plugins. Checking through all the plugins in the plugin_urls.lst sounds right
[22:26:35] <ebernhardson>	 inflatador: not to say calling cirrus a plugin would be wrong, plugin and extension are about the same :) But each context has a unique name i suppose
[22:27:36] <inflatador>	 The repo is called " mediawiki-extensions-CirrusSearch" so that's a fair point
[23:45:10] <inflatador>	 ebernhardson re: your earlier comment about puppet classes/manifests. Where can I find them? There's only a few "site.pp"s in the operations/puppet repo
[23:58:59] <ebernhardson>	 inflatador: you want manifests/site.pp from the root of the repository, the ones deeper in are unrelated to this purpose
[23:59:09] <ebernhardson>	 inflatador: it's the one that is mostly filled with node declarations