[07:57:54] o/ dcausse: do you have time to continue working on the ES sink? Otherwise I could pick up and build a release. [08:56:53] pfischer: yes please, thanks for the offer! :) [08:57:44] I mean "no I don't have much time" so please go ahead :) [08:58:05] dcausse: Sure, no worries. 🙂 [08:59:59] I pinged Luca, so hopefully we can move https://phabricator.wikimedia.org/T351503. Balthazar already offered his support in setting up the topic once we get green light from OPs [09:15:17] nice! :) [09:15:29] sadly the topic already exists :/ [10:36:46] errand + lunch [14:45:51] quick errand [15:39:22] hello :) [15:39:54] while browsing the production logs for MediaWiki, I found out that rolling 1.42.0-wmf.7 causes an insane amount of log originating from CirrusSearch [15:40:01] ex Pool key 'CirrusSearch-Search:_elasticsearch_enwiki' (CirrusSearch-Search): ⧼poolcounter-connection-error [15:40:17] I have filed it as an unbreak now at https://phabricator.wikimedia.org/T352444 cause I don't know the impact [15:40:39] I don't know why `⧼poolcounter-connection-error⧽` uses those special angles brackets [15:40:50] and maybe it is a placeholder for the actual underlying error [15:41:12] dcausse ^ [15:41:18] hashar: looking [15:41:49] that's all over the place... [15:41:49] the screen shot shows the messages appeared with group 1 wikis and of course the rate of message doubled this morning as I promoted the rest of the wikis [15:41:58] so I guess it is log spam originating from some code in CirrusSearch [15:46:13] and poolcounter has some sharp drop since I rolled the train tihs morning ( https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&from=now-24h&to=now ) [15:46:22] then I don't know what is using poolcounter [15:47:20] hashar: the angle brackets is probably missing i18n message [15:47:28] The rest I can't help with [15:48:13] Which does make sense [15:48:15] https://github.com/search?q=repo%3Awikimedia%2Fmediawiki%20poolcounter-connection-error&type=code [15:48:24] don't see anything new in cirrus, wondering if it's the poolcounter failing to load its config and falling back to very low values [15:48:33] Has no en.json result [15:48:39] yeah I am just surprises it is not using the good old < and > [15:49:01] hashar: no idea on the specific type of brackets [15:49:11] That definitely should be a task filed though [15:49:20] Because an actual error is being hidden [15:49:29] dcausse: it looks like we have lost poolcounter yeah [15:50:05] so that might not be just Search? And Search is overloading the logs because Search is the largest user of poolcounter? [15:51:55] that is my guess yes [15:52:35] or something else entirely [15:54:38] https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&from=now-2d&to=now&var-dc=codfw%20prometheus%2Fops [15:54:56] that is the 2 days view of locks held/released and requests processed [15:55:04] which goes down whne promoting group 1 [15:55:13] and fall to almost nothing once promoint all wikis [15:55:16] promoting [16:02:28] \o [16:04:25] Amir1: I see you doing PoolCounter stuff [16:04:27] See here [16:06:00] You know I'm already pinged on this? [16:06:19] so we ended up continuing in the _security channel [16:06:44] and most probably the issue is in mediawiki/core and the patch is being rolled back https://gerrit.wikimedia.org/r/q/4344b2fb80727daa44eb461e316421ac803d8df1 [16:06:45] :) [16:07:13] Amir1: no I can't see the conversation anywhere else [16:07:37] then please assume, I'm aware ^_^ [16:11:22] o/ [16:24:02] surprised that searches were still working with the poolcounter broken... [16:24:10] oh! that seems risky [16:24:39] seems like we fallback to "no poolcounter" in this case? [16:24:58] that's scary 😅 [16:25:14] yes :/ [16:34:29] okay the patch is deployed dcausse [16:34:37] Amir1: thanks! :) [16:35:19] this is because: https://gerrit.wikimedia.org/g/mediawiki/core/+/e49f585abaad23fa0b2d78c3d1495fbe647a61c6/includes/poolcounter/PoolCounterWork.php#154 I think [16:36:07] wondering if we should prefer failing the work in that case (for search at least) [16:37:11] this problem remained unnoticed for 24+h [16:37:15] hard to say, my main datapoint is when we first deployed codfw cluster i ran a loadtest that didn't use any connection limiting and the whole cluster fell over [16:37:31] but that was several versions ago, smaller cluster, who knows [16:40:47] the issue got solved by reverting the faulty patch in mediawiki/core [16:40:49] yes not sure what's appropriate, take the risk and let the system respond or fail everything so that gets fixed quickly... [16:41:20] hashar: \o/, thanks for taking care of this! [16:41:50] Amir1 pushed all the buttons! :] [16:42:31] I like pushing shiny buttons [16:45:04] ebernhardson: I do have a setup working with a ES bulk processor that has access to the sink’s init context/metrics but I can’t get the artefact into archiva (401). [16:45:37] pfischer: hmm, i haven't seen a 401 from archiva thats odd [16:45:54] pfischer: do you have an archiva password? [16:47:04] ebernhardson: I do have credentials (LDAP) I can log in at the UI https://archiva.wikimedia.org/ [16:47:09] i suppose i have this configured, it's probably what prevents the 401: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Archiva#Deploy_to_Archiva [16:47:37] and we might need to check your ldap groups, this says you need archiva-deployers [16:48:11] Same here, I’ve got my encrypted password associated with archive.snapshots… yes is that something I can do myself (check)? [16:48:33] i'm not sure, i suspect ryan or brian can do the ldap bits. I'm trying to see if i can even query ldap [16:50:03] According to the lookup (via mwmaint) I’m only in the ou=people [16:50:36] yea seeing the same, I suspect we need https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group [16:51:03] not clear if that requires doing a full access request or if it can just be done, i would vote for just do it (by sre with sudo) [16:51:18] inflatador: ^ [16:51:50] ebernhardson ACK, will take a look once I'm out of mtgs [16:51:55] thanks [16:53:51] ebernhardson: in the mean time: which metrics would you be interested in? noops total | per batch | batch size | failures (per batch and update type?) [16:54:39] pfischer: i think the main thing would be noops rate broken out by source, a rate of successfull and failed writes as well [16:54:47] i don't know if per-batch is too meaningful, i think it can just be overall [16:55:10] maybe batch size would be interesting, get more insight into what it's doing [16:55:38] Yeah, that’s what I thought as well. Alright, I’ll set them up. [16:56:10] the version of that in Cirrus ends up here, but it isn't broken out it's only top level: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&viewPanel=43 [16:56:23] it's just created/noop/updated per second [16:56:30] which probably map directly to elasticsearch status's [16:58:29] pfischer: oh, it looks like we also record per-index stats but only for specified indexes. The cirrus impl for reference is https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/DataSender.php#L408 [16:59:13] or maybe thats all indices, mhm [16:59:54] Trey314159 might be 1-2m late to our mtg [17:00:11] inflatador: no worries [17:00:44] huh, i guess i didn't remember that, but indeed in grafana we have per-index stats on noop/created/etc. I'm not sure if we ever used that [17:01:35] I look at it once in a while but not sure that's particularly useful [17:07:00] oops I missed the restbase meeting, ebernhardson did you attend? [17:07:16] dcausse: yup, went over the restbase / related articles question [17:07:25] cool [17:07:56] dcausse: the summary is they need more data from the apps teams about perf and migration concerns, plausible this will move to the action api with a generator to provide summaries, but not 100% certain yet [17:08:52] ok [17:11:13] AFK, back in 60’ [17:24:55] i don't know if it's strictly required, but it seems it would be annoying if eventbus keeps trying to recreate the topic by sending new events, so i've added a patch to deploy window later today to disable event bus bridge on testwiki [17:35:40] lunch, back in ~1h [18:04:10] dinner [18:30:04] back [19:10:30] CR to get rid of the broken wdqs ldf check if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/979149/ [19:10:59] I'm pretty sure I know how to fix it, but let's get rid of the bad one in the short temr [19:11:48] looking [19:12:09] ebernhardson no worries, already got a +1 [19:12:28] jj [19:12:30] kk [19:15:19] If you have ideas on the best place to put that ldf endpoint check LMK, digging thru our manifests now [19:21:34] not sure, i suppose in theory what i would like to see is something that hits the full pipeline the same way a user would, from outside our network [19:21:55] no clue if thats a general thing though :) [19:25:38] Yeah, was wondering that myself...current blackbox checks for query.wikidata use the public URL, but actually poll from our own prometheus servers [19:26:08] Not completely "blackbox," but good enough I guess [19:28:58] I guess we can use a flag in modules/profile/manifests/query_service/common.pp like we do for NFS dumps...then flip on the flag for the current ldf endpoint host [19:33:02] inflatador: there is already an enable_ldf flag, which purports to make it into the nginx template. Can that flag be reused? [19:33:59] hmm, no i don't think thats specific enough [20:18:21] huh, i can't seem to add notes to dagrun's in airflow anymore, gets a 500 : psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: "None" [22:07:20] pfischer back [22:08:29] pfischer ebernhardson can you confirm you need the "archiva-deployers" group? Working on it now [22:10:49] nm, confirmed [22:10:55] inflatador: that should be the one, it matches the wikitech page on archiva, and `ldapsearch -x cn=archiva-deployers` shows a set of users that looks about right [22:12:59] ebernhardson ACK, added pfischer . Hit me up if y'all are having issues [22:15:12] inflatador: should that change be visible when performing ldapsearch -x uid=pfischer? [22:16:23] I still get a 401 from archive and the ldapsearch only shows ou=people. Does it take some time? [22:17:54] pfischer `ldapsearch -x uid` won't show your groups . I see you when I run `ldapsearch -x cn=archiva-deployers` but I do think it might take time [22:31:57] ebernhardson: are there any restrictions which maven artefact coordinates I’m allowed to deploy, for example, only org.wikimedia.* but not org.apache.flin.* ? [23:03:08] Hm, I still get a 401, running with -X I see that it picks up the right username (configured in settings.xml) [23:06:13] maybe there are other groups needed as well? [23:06:13] pfischer: hmm, not sure [23:06:30] i think i've only released org.wikimedia artifacts before, checking archiva config [23:08:39] i don't see anything relevant there, it's not actually clear how all those repositories are defined [23:09:05] * ebernhardson should know since he added a new repo there years ago :P [23:15:25] hmm, i suspect there must be another level of auth, back then it was the single archiva-deploy users and iirc there was more access available in the web ui. wikitech page confirms there should be an administrative web ui, but that we could see the config the web ui mutates in /var/lib/archiva/conf/archiva.xml [23:16:37] i think that would be archiva1002.wikimedia.org, but i can't peek at that server [23:30:21] i suppose from the other direction though, can see in the browser that wikimedia release repository has stuff from many different namespaces. Seems most likely something else is the problem