[08:07:16] Yay, back in 2022, may it be merry for you all! [08:25:01] o/ [08:25:07] happy new year! [08:39:42] I think I'll give up on upgrading spark just to get rid of this very old guava dep, it does not seem worth it after all [10:42:37] lunch [10:42:52] errand (back in 2.5-3h) [11:37:24] lunch [13:56:48] Greetings all, and welcome back zpapierski ! [14:04:52] inflatador: great seeing you here (I'm quite aware that you have been here the whole time :) - looking forward to our 1/1 ! [14:05:46] super late lunch (almost dinner even) [14:06:01] For sure, should be fun [14:07:26] Can anyone help me merge this guy? https://gerrit.wikimedia.org/r/c/operations/alerts/+/751513/ . I have "rebase" but no "submit" button . Looking at the Gerrit tutorial on MW, I might need a "+2"? [14:08:04] inflatador: you probably don't have +2 access to this repo yet. Let me see if I can find who has the right to add you there [14:09:10] inflatador: do you see a "+2" button if you click reply? [14:09:15] inflatador: it looks like the ldap/ops group is the owner: https://gerrit.wikimedia.org/r/admin/repos/operations/alerts,access [14:09:38] in theory it should come with the ops ldap group, so if you don't try logging out and back in so it refreshes the group cache [14:10:08] taavi when I click "reply", there IS a code-review button and "+2" is an option. Do I need to give my own code a "+2" before I can merge? [14:10:10] we might have missed the step to add you to that ldap group [14:10:35] yeah, +2 is referring to that button [14:10:54] OK, let me give that a shot [14:11:06] inflatador: Oh yes, the gerrit teminology is confusing. Code-Review +2 means that you're OK to merge [14:11:35] so in most repositories, giving a Code-Review +2 is going to start Jenkins jobs on that commit, and if they pass, it'll get merged [14:12:18] operations/puppet is the notable exception (and there are a few others), where you need to manually "submit" it as well after giving it a +2 [14:12:25] there is a whole rule engine (Prolog based if I remember correctly) to define what are the pre-requisite to a merge. For most repos we need Verified +2 (meaning that CI is OK to merge) and Code-Review +2, which means that a human agrees to merge. [14:12:40] Looks like that's happening now. And yes, giving my own code a "+2" is a little unintuitive [14:13:08] For most repositories, merging your own code is viewed as bad Karma. [14:13:55] Yeah, it's a bad practice in general, but I guess it's OK in (very limited) circumstances like this [14:14:02] in the mediawiki world, +2 (not +1) is given by the reviewer when they think the patch is good and can be merged and deployed in the next "train" (a weekly thing where new mediawiki code is deployed) [14:14:13] SRE related repositories are the exception, since in a lot of cases we want the owner of the change to be around when it is deployed (and puppet needs some additional steps) [14:14:38] Never, ever merge a change without someone else giving you a +1. [14:15:33] Even in case of emergency, having a second pair of eyes reviewing what you're doing in a moment of stress can cost much less time than merging something too hastily. [14:15:35] If I ever even think of doing that, feel free to fly over here and slap me [14:16:30] All rules are meant to be broken, but make sure you understand why the rule is there before breaking it. [14:16:54] inflatador: does your browser support POIPAAS? [14:16:55] https://poipaas.com/ [14:18:12] I saw the POIPAAS truck in my neighborhood the other day. I'm sure we'll have it soon! [14:54:23] errand [14:56:58] dcausse: anything interesting happened with wcqs updater when I was away? [15:07:50] zpapierski: Erik ran a quick test of the updater and it looked promising and this morning I deployed a patch to increase the capacity of the session clusters in k8s [15:08:20] on staging or already on eqiad/codfw? [15:08:40] (I'm mean the test) [15:09:02] so the test was on prod data but ran in yarn [15:09:09] I see [15:09:22] anyway, cool :) [15:10:57] there's this patch https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/745629 pending [15:11:10] yep, I'm looking at it rn [15:33:49] just realized that https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/rdf-spark-tools/src/main/scala/org/wikidata/query/rdf/spark/EntityRevisionMapGenerator.scala needs to be adapted for wcqs [15:33:51] and it's not scheduled by airflow like the one for wikidata [15:38:20] filed T298622 for whoever has time to work on it [15:38:21] T298622: Adapt EntityRevisionMapGenerator for wcqs - https://phabricator.wikimedia.org/T298622 [15:40:40] gehel or anyone else, what is the proper list for discussing this type of changes with the general SRE group? https://phabricator.wikimedia.org/T298570 [15:42:23] I'd say the general SRE group is basically a definition of #wikimedia-sre channel [15:43:21] inflatador: or if you want to have a more permanent trace, the ops@ mailing list (https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/) [15:49:41] Thanks all, will hit the list first to have a bit more permanent record [15:51:00] dcausse: are we ok with presenting our lightning talk next week on tech-all? note that unfortunately it conflicts with office hours [15:51:18] zpapierski: fine by me [15:51:39] ok, I am as well [15:52:14] although recent wcqs auth controversy might bring some people, but we have a additional meeting to address that, so it's probably ok [15:53:46] not sure... but perhaps, I think this kind of discussion generally stays on phab/wiki [15:54:00] remember skolemization? [15:55:00] true but I think this is a bit different here, we'll see :) [15:55:17] I agree it's different, I think it generated even more heat :) [15:56:49] Hey-o! I survived 90 minutes standing outside in the ~0°C/32°F cold to have a giant q-tip stuck up my nose. Good times! [15:57:11] :) [15:58:24] You're having a peachy keen day so far, I bet [16:01:41] \o [16:01:49] hmm, no wednesday meeting on my cal? [16:03:35] no meeting week [16:04:17] but happy to jump into one if we feel we need/want it [16:04:31] mostly i'm wondering where we stand on wcqs and how we get that moving [16:05:00] inflatador: double peachy keen, in fact. I can feel my toes now, so all is well. [16:05:11] i suppose i can work on the entity revision map, i thought i brought that up to zp early december and he did a test run for wcqs [16:05:26] i hadn't looked at what it really does, just that it does something :) [16:06:01] I quickly looked at the code and it's the same problem as usual: how to build this UriScheme [16:06:21] perhaps it worked with a wikidata like UriScheme? [16:06:30] ebernhardson: I don't jumping in as well, especially that week from now is an office hour and then I'm gone [16:07:31] looking at hdfs:///wmf/data/discovery/wcqs/entity_revision_map I see two runs indeed [16:07:49] 20211107 20211114 [16:08:03] I think I did two runs, actually [16:08:06] not sure though [16:08:10] i was off by a month, close enough i guess :) [16:08:26] zpapierski: do you remember the options you passed? [16:08:30] if we just need cli args, thats super easy. [16:09:09] yes I think that's mostly it, something similar to what we've done for the flink app perhaps [16:11:24] oh it does not even need that since it's using the schema:version which is project agnostic [16:11:46] I have a script on stat1007 [16:11:59] extract_rev_map.sh in my home dir [16:12:32] well the codebase needs a UriScheme and it's better if it's properly constructed but in this particular case it works with whatever value is passed as --hostname [16:13:07] huh, I didn't set any additional parameters though [16:13:33] inflatador: did you find a way to check that https://gerrit.wikimedia.org/r/c/operations/alerts/+/751513/ was deployed properly? [16:13:37] hostname is optional and is set to wikidata.org as default [16:14:17] hm... it needs proper UriScheme to extract the entity id with urisSchemeProvider().entityURItoId(statement.getSubject.toString) [16:14:32] dcausse: can join unmeeting room [16:14:37] ah [16:15:02] bootstrap is in ./flink-1.13.2/bootstrap.sh [16:15:07] inflatador: it looks like alerts.wm.o only shows active alerts [16:19:59] gehel eb helped me find some info in logstash https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=h@75c69a9&_a=h@09865d8 [16:22:09] if you're asking, am I confident the alert will fire when things are bad, but NOT when things are OK? I'm not completely confident. I took down the values from 22 Oct, 30 Dec, and non-alerting times but I don't feel like I grasp everything involved [16:22:52] No, I'm asking if you are confident that the alert will trigger on the new 0.03 threshold [16:23:15] Are you sure the what you merged has been properly applied on our alerting infrastructure [16:27:14] It was merged, but I haven't checked to see if it was pushed out. Do those configs live on individual nodes, or a prometheus server? [16:29:35] I assume it exists on the prometheus server. I'm fairly confident that the deployment is working correctly and that the change has been applied, but I've never done such a change myself and I'm not super familiar with how it actually works. [16:30:46] Let me poke around wikitech and I'll hit up folks in wikimedia-sre if I can't find the prom server [16:31:28] looks like that repo is cloned via https://github.com/wikimedia/puppet/blob/production/modules/alerts/manifests/deploy/prometheus.pp and https://github.com/wikimedia/puppet/blob/production/modules/alerts/manifests/deploy/thanos.pp [16:33:49] this ends mapping to the prometheus::pop and thanos::frontend roles [16:34:18] we could check one of those nodes to see if the change is there [16:34:55] But since this is all going through a simple git::clone, there isn't much room for things to go wrong. I'll assume it's all working fine. [16:35:21] I would like to check it, but is there an actual inventory somewhere? Reading https://wikitech.wikimedia.org/wiki/Prometheus ATM [16:36:42] I don't see any obvious link in thanos.wm.o [16:36:56] don't loose too much time on this. It is most probably all fine! [16:37:17] no worries, I am gaining some valuable context on things if nothing else [16:40:24] if I tcpdump from one of the WQDS servers, looks prometheus2003.codfw.wmnet is scraping it for metrics [16:41:02] prometheus2003.codfw.wmnet has a /srv/alerts/team-search-platform_blazegraph.yaml with a modified date of today, and it matches what I merged earlier [16:44:43] looks good! [17:05:11] quick workout, back in ~30 [17:42:47] back [17:42:56] regarding inventory (list of machines that do things), i don't know that it's how most people do it but usually i find the appropriate puppet class, trace that back to a role, then look in manifests/site.pp of the puppet repo and choose a random machine [17:43:15] I do the same! [17:43:52] depending on what you mean by inventory or what kind of question you're asking, netbox is a good place [17:44:26] Thanks, adding both methods to my notes. Although it would be pretty cool if we had a service mesh ;) [17:47:24] hmm, i ran disable-puppet on wcqs-beta but it still sent a puppet failure email :S [17:47:50] ssh'ing in verifies puppet is disabled. I guess the failure email comes from elsewhere [17:53:05] the alert probably come from checking the latest puppet report :/ [17:53:10] dinner time, see you later! [17:57:56] yeah, and even if the last run would be successful it would complain if the last run would be too old [18:52:35] hmm, did a successfull puppet run on wcqs-beta and then turned it back off. Not sure whats appropriate, the new puppet code (intentionally) can't deploy an un-authenticated wcqs instance. for now turned it back off and adjusted the nginx config back to un-auth'd [18:54:39] we (wmcs) prefer to have puppet running, I guess it mostly depends on how long do you imagine it's going to be that way [18:55:09] it's being decom'd once the prod service is up, but setting up the prod service is what is making the beta site no longer work [18:57:45] lunch/errand, back in ~45-60m [18:59:07] * ebernhardson finds it oddly reassuring that moving more_like traffic from codfw->eqiad reduced cluster cpu by 10% in codfw and added 10% in eqiad. [19:49:39] and back [19:58:14] now my turn for lunch :) [20:48:13] back [21:42:47] Is this the complete list of plugins that have to be updated before we can move to ES 6.8.20? https://phabricator.wikimedia.org/T271777 [21:47:48] inflatador: hmm, there is probably another ticket i'll look. For the source of truth i would query one of the clusters, such as https://search.svc.eqiad.wmnet:9243/_cat/plugins | sort [21:49:46] inflatador: or i guess, we ship the plugins as a debian package so https://github.com/wikimedia/operations-software-elasticsearch-plugins is another good source of info [21:52:32] I suppose the plugins ticket is https://phabricator.wikimedia.org/T294499 but now i wonder if we are using plugins to mean the same thing :) [21:57:21] ebernhardson thanks, I know we will have to do the debian pkg thing too [22:24:19] Still looking at the ES7 upgrade epic ( https://phabricator.wikimedia.org/T263142) , I'm seeing plugins for elasticsearch, and mediawiki plugin related to ElasticSearch (CirrusSearch) . I'm focusing on compatibility-testing plugins for ES itself as opposed to the MW plugins, let me know if I need to do something else [22:26:13] inflatador: that should be correct, i think the confusion is cirrus is an extension (to mediawiki) and you're looking at the elasticsearch plugins. Checking through all the plugins in the plugin_urls.lst sounds right [22:26:35] inflatador: not to say calling cirrus a plugin would be wrong, plugin and extension are about the same :) But each context has a unique name i suppose [22:27:36] The repo is called " mediawiki-extensions-CirrusSearch" so that's a fair point [23:45:10] ebernhardson re: your earlier comment about puppet classes/manifests. Where can I find them? There's only a few "site.pp"s in the operations/puppet repo [23:58:59] inflatador: you want manifests/site.pp from the root of the repository, the ones deeper in are unrelated to this purpose [23:59:09] inflatador: it's the one that is mostly filled with node declarations