[08:06:42] tanny411: (you're probably already aware) but Erik has deployed your dag, it's currently backfilling [08:13:42] dcausse, gehel: Yes, I saw, thanks :D. I am closing the ticket T273854. Wondering how to verify and when to close the ticket for 90 days deletion ( T283258 ). Should we wait to verify deletions in ~78 days or close it for now, reopen when required? [08:13:42] T273854: Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 [08:13:43] T283258: Provide a job regularly deleting wdqs processed query after 90 days - https://phabricator.wikimedia.org/T283258 [08:15:00] tanny411: I'd consider them done but we generally let Guillaume close them once they are in the "Needs Reporting" column in https://phabricator.wikimedia.org/project/view/1227/ [08:15:42] if the cleanup does not actually work we'll open a new ticket I think [08:16:18] Ah, was also going to ask what needs reporting means. Okay I will let him handle that then :) [09:51:12] lunch [11:08:27] break [13:22:28] zpapierski: when you get a chance: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/695295 [13:22:33] and https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/693129 [13:23:06] considering I'm currently browsing through phpcs errors I'd welcome a distraction [13:23:14] thanks! [13:23:37] https://www.irccloud.com/pastebin/TG0cbNhg/ [13:23:56] does it mean this fixes both, or both will not be fixed even after the patch :)D ? [13:24:07] s/:)D/:D [13:24:45] still not fixed, lemme check what was actually fixed [13:24:47] here [13:24:58] oh wow, I was a reviewer of this one for long, msomehow missed that [13:26:13] it's fixing the parallelism and adds a way to only extract/debug the kafka offsets (which are not yet working) [13:26:26] ok [13:26:42] parallelism forced to 1 to avoid creating a 12 parition csv file with only one entry [13:33:21] few small comments on this one https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/693129 - moving to the other [13:34:36] thanks [13:36:58] regarding this https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/695295 - how did you catch the issue? [13:38:16] running the pipeline, weird error when checkpointing like NPE inside flink classes [13:38:35] showing that internal states were not properly initialized [13:38:45] can we actually test for that? [13:40:32] hm.. don't know.. looking [13:41:46] well we removed the wrapper so there's nothing to test actually, other than testing that the HDFS file sink works [13:43:38] other than a very difficult integration test [13:43:42] ok, then [13:44:03] done [13:44:18] thanks! [13:44:50] no worries, sorry for not doing this sooner [13:45:03] np [13:45:27] I'll try to improve, before I do, shout out [13:48:25] I will! [14:05:06] I messed up the reload on wcqs when restarting blazegraph to add a new federation endpoint [14:05:43] looking at the reload script it seems it's not a big deal, the download part was commented [14:06:47] we should probably release the rdf project and clean this up before next week run [14:08:43] ok, during which phase? [14:09:39] ah, it shouldn;t have been commented out, I did that last week [14:09:47] it was importing and failed curl [14:09:56] I thought that curl would retry but no [14:10:30] but since it was reimporting the same dump I guess it's no big deal that it failed [14:10:53] it's a big deal that it is the same dump, but that's on me, I'm actually glad that you brought it up [14:11:43] this machine not being deployed by scap like other nodes is the issue [14:11:49] nope [14:12:23] it's all manual, I'm not sure why we decided it's ok that way, but it's been a huge PITA of mine [14:13:20] anyway, I'll take care of it - so the changes you did aren't in release? [14:14:42] I think the changes that are not deployed yet are the ones related to the blank nodes [14:15:12] they're merged but they have to go through the full release process -> prepare-deploy.sh thing [14:15:46] on WCQS? are you sure? Haven't we tested them recently? [14:16:22] git status on wcqs shows modified: wcqs-data-reload.sh [14:17:22] the change that's waiting for a release + deploy-prepare.sh is: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/698452 [14:17:50] true - I applied it manually [14:18:02] ah, hence the confusion [14:18:26] the change I deployed is unrelated and is in the deploy repo directly https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/699746 [14:18:43] I didn't want to do a deploy back then, because it was during outage - I should've done that later [14:20:01] in any case - changes for whitelist are there, and no deploys happened in the mean time [14:21:09] so there's probably some code for rdf not yet in deploy [14:21:24] (outside of skolemization one) [14:22:43] hm I see that query-service-parent-0.3.74 was released [14:22:48] hmm, weird - there are tags for 0.3.74 [14:22:52] yeah, same thing [14:23:09] but it's not in release repo - I thought process did both at the same time? [14:23:32] someone has to run deploy-prepare.sh I think [14:24:03] I think that part was missed [14:24:08] so version was released but deploy-prepare.sh wasn't executed [14:27:14] I don't know why it anyway, it seems that 0.3.74 has all the changes, so I'll create a deploy for it [14:27:25] yes [14:28:20] ryankemper: I think we missed a couple wdqs deploys, looks like the deploy-prepare.sh part was missed I think [14:29:01] the release part is very slow (45mins IIRC) [14:29:48] https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/700924 [14:29:53] (that script is a godsend) [14:30:47] hmm, am I blind or was streaming updater producer jar ommited? [14:31:08] the producer is new [14:31:27] ah, ok [14:31:27] it's now part of the distrib as we will deploy it from here [14:31:34] I understand [14:34:08] I'll just go ahead and merge that release, ok? [14:35:18] zpapierski: yes this repo is self merge [14:35:32] assuming you deploy right after [14:35:38] I will [14:35:51] ah, the wdqs will take some time [14:35:55] you need to V+2 and merge on this one [14:35:57] (which I don't have rn) [14:36:39] I can do it np [14:37:45] I don't need to merge yet [14:38:01] but it would be helpful - so I'll deal with WCQS and you with WDQS? [14:38:15] (I know I'm selecting the easy one :D) [14:39:05] zpapierski: wdqs is the easy one, it's all automated [14:39:14] true [14:39:30] ok, it's ready then [14:39:38] I'm off to wcqs deployment [14:39:40] always fun [14:46:57] meh... I forgot that only the main blazegraph is restarted... [14:47:01] ok, I've deployed changes to WCQS and relaunched the update process - since it was started today, we shouldn't conflict with the next one [14:47:25] a yes, I'm guessing you have your "restart everywhere" script somewhere :) [14:47:40] (me too, probably) [14:48:05] yep, I do [14:48:21] though hostlist is probably out of date [14:48:31] it's been a long time, I don't know if I have an up to date list of wdqs machines :) [14:48:55] that's should be easy to remedy, we have groups defined in puppet [14:49:47] sure [14:51:59] you have it? I compiled mine [14:53:50] yes I think so [14:56:52] should be 19 nodes total [14:57:08] I concur with that statement [14:59:29] wow there seems to be mass deletion happening on wikidata [14:59:52] what's "mass" in numbers? [15:00:15] the number of triples going down, it's quite rate [15:01:19] some bot gone haywire ;) ? [15:04:40] addshore: are you aware of anything special happening? https://grafana-rw.wikimedia.org/d/000000170/wikidata-edits shows a negative size change as well [15:04:57] O_o [15:05:55] might just be legit edits but it's the first time I see this [15:09:11] yeah, looks extremely unusual [15:13:17] we are looking at it a bit [15:13:19] thanks for the ping [16:00:55] \o [16:03:59] Question from Seve regarding image recommendations: [16:03:59] >In the upcoming Growth pilot for Image Suggestions, there's a need to have search indexed with Image Matching Algorithm (IMA) results so that Growth has a way to filter suggestions by topic. @mt @Carly Bogen (she/her) Can you all confirm whether this is available today or not? If not, I think we'll have to schedule a follow up on the feasibility of this within Q1 [16:04:28] my understanding is weighted tags should handle this, but wanted to confirm [16:08:36] mpham: hard to say. I would hope weighted tags covers some of that use case, but it's a boolean yes/no kind of thing. I suspect IMA isn't exactly a boolean so we have some impedence mismatch, how mutch it matters depends [16:10:50] not sure if this documentation they provided helps with any more details: https://www.mediawiki.org/wiki/Wikimedia_Search_Platform/Decision_Records/Recommendation_Flags_in_Search [16:12:21] what would the unsupported use case be if there is some impedence mismatch? [16:13:04] mpham: the document is ok, the mismatch I [16:13:26] the mismatch I'm concerned about is that i don't think "image should be recommended" is really a boolean. It depends on context that we can't take into account [16:14:15] so mostly i suspect they aren't going to actually like the results, unless so few images a recommendable that we only pick one or two at a time [16:14:41] but who knows, gotta find out :) [16:16:26] ebernhardson: so if the IMA and/or topics are returning boolean values for weighted tags, we're good; if not, then it seems like we're not really sure what happens? [16:18:03] mpham: in traditional search matching, if it doesn't match it doesn't get returned, so they get a small result set. My worry is that search will return any document with the appropriate words in it an the recommended flag, not because it's recommended for this particular search query / context / etc. [16:19:10] is this different from what they do with "add link" tasks? [16:22:44] dcausse: in principal its the same, i suppose I'm thinking of them quite differently though, in that I suspect users will expect that the image was recommended for *their* query, rather than a global boolean flag [16:24:32] I thought here that the topic is the topic of the page as detected by ores [16:25:24] Trying to think why link recommendation seems different...perhaps just not understanding the tasks involved but a page needing new links seems like a property of the page, an image being recommendable seems like it's conditioned on the search query, not a global flag [16:25:52] dcausse: hmm, is it not going to use free-form text? In that case it would be different [16:25:59] I understand it as: give me pages of topic XYZ for which an image is required [16:26:11] but I might totally be wrong [16:26:42] if it's free form then yes we don't know how to do this [16:26:53] If there is no arbitrary search query, and instead wide-area filtering (ores topics/etc) then it will work fine [16:28:25] ebernhardson: (completely unrelated) we had to clean airflow logs again, it's not clear to me if it's still a consequence of the airflow bug you found couple weeks ago or simply that retention/loglevels need to be adjusted [16:30:38] saw this that was added in 1.9: https://www.astronomer.io/guides/logging [16:31:03] dcausse: i do believe topic here is topic of page as detected by ORES [16:32:47] dcausse: hmm, i haven't dug in too deep (airflow logging directory structure is inconvenient), but back of he napkiin math says its expected-ish. We get ~25M log file per dag per day. 25M*90 days gives 2.2GB of logs per-dag [16:33:16] or maybe our retention is 60 days, but same problem. We only have ~7GB available for logs on disk [16:33:38] the scheduler is producing 250M per day [16:33:56] maybe need a way to turn down the volume and ship them to logstash (last i looked firewalls in the way, but maybe not anymore) [16:34:43] would make sense indeed [16:35:54] I suppose the other direction is nothing here is compressed, it ought to be able to compress logs older than 1 day or some such [16:36:22] a simple gzip turns a 250M directory into 13M [16:36:24] mpham: if it's ores then it looks very similar to add link, note that ores topic models is available everywhere (see https://ores-support-checklist.toolforge.org/) [16:37:05] I wonder as well where it gets the task logs when you request from UI [16:37:31] those come directly from disk, would have to verify that supports compressed output [16:37:43] ok [16:38:31] other options i see online, people register bash scripts with the airflow scheduler to clean itself up :) [16:38:38] (basically just delete logs) [16:39:12] I guess it's better than a cronjob :) [16:39:45] mpham: sorry I meant: ores topic models is *not* available everywhere [16:40:58] dcausse: thanks. i'm trying to confirm some details on their side now [16:45:15] surprised to find so little information on compressing airflow logs. See people recommending shipping them to S3 (or equiv) instead of keeping on disk, but they *still* don't compress them. Seems crazy to me that people pay per GB to store data and don't try to compress it 20x [16:53:55] other thing I've seen but might be hard to tune is that some logs are not particularly useful, like from the spark_submit_hook.py: Identified spark driver id: application_1623774792907_28373 [16:55:35] going offline [16:55:45] lol, yea the spark logs are some of the worst. Almost always useless unless the task fails completely, in which case it's the only hint :) [16:55:48] take care! [17:12:49] ebernhardson: dcausse : looks like the image rec use case is the same as AddLinkScreen Shot 2021-06-22 at 12.11.40 https://usercontent.irccloud-cdn.com/file/KUb20GkR/image.png [18:44:05] hi, I just wanted to make sure the instructions at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#ElasticSearch are still up to date [18:44:36] that we still need to hardcode the more_like queries to eqiad, and then 24h after switching, remove said hardcoding so they go to codfw [18:45:11] legoktm: hmm, i'll have to look. The problem has to do with an empty cache in codfw [18:46:03] thanks [18:46:43] if it still does require manual intervention then I'll file a task asking for some way to automate it (not necessarily for this switchover, but just the future), whether it be having a cookbook to warm up the cache, replicate it, etc. [18:47:45] legoktm: hmm, we are using MainWANObjectCache. In theory that suggests codfw should get the same cache, but i'm not wholly familiar with how that's implemented [18:52:58] * legoktm hops channels [18:55:27] ebernhardson: so the WAN cache does not replicate to codfw [18:55:39] so it'll need some kind of cache warming [18:58:18] legoktm: yup i just found that in the docs as well. Since there is no replication, indeed we need some system in place. A generic warming setup would be to duplicate requests that look like a particular to the other dc and send output to /dev/null [19:10:22] ok, so I filed https://phabricator.wikimedia.org/T285347 "Improve automation of CirrusSearch caches during database switchover" [19:11:29] and then based on what Krinkle said, I filed https://phabricator.wikimedia.org/T285346 "CirrusSearch WAN caching should use getWithSet() instead of manual get()/set()"