[07:14:50] o/ [07:23:43] o/ [07:24:22] just reset a sensor that was failing with "ExternalTaskSensor.execute cannot be called outside TaskInstance!" [07:24:29] first time I see this issue [07:24:42] possibly related https://github.com/apache/airflow/issues/41470 [08:11:38] Good morning! [08:12:04] It looks like those categories endpoint were not happy to be left alone for a week! [08:12:21] dcausse: do you already know what it is ? [08:12:48] gehel: no [08:13:00] I don't think this is emergency, so maybe we can wait for inflatador to be around later today. [08:13:11] sure [08:13:24] Or maybe brouberol wants to learn more about WDQS ? [08:44:35] I'm catching up a bit, I can try to have a look today, but I don't know much/anything about WDQS [08:48:15] brouberol: happy to pair on this whenever you want [08:48:51] quickly looking at https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&var-graph_type=%289103%7C9194%29 I see some data being added so could well be some issue with the query that monitor the lag [09:22:04] I created T385972 and T385971 to capture follow up work in the MLR spike [09:22:04] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [09:22:05] T385971: Investigate abandoned queries and identify eventual model improvments - https://phabricator.wikimedia.org/T385971 [09:22:59] Trey314159 FYI: this is something we discussed at the offsite (thursday morning session). Happy to touch base later at sync! [09:24:20] gmodena: thanks! you might want to link T375554 from T385971 I think [09:24:21] T375554: Classify fulltext search abandonment: English, French, Spanish - https://phabricator.wikimedia.org/T375554 [09:26:01] dcausse thanks for the pointers! [09:26:30] my phabricator search-foo failed me again :) [09:37:37] to be fair it took me some time to find that one again :) [12:06:42] lunch [14:15:29] Oh no, what happened with categories? [14:17:11] o/ [14:19:15] ah, I see, the categories lag alerts [14:20:54] I think those are false positives, but will verify [14:27:13] o/ [14:36:07] gehel: do you know how users like project_${project_id}_bot_${some_hash} get added to gitlab project members? (e.g. https://gitlab.wikimedia.org/repos/search-platform/opensearch-learning-to-rank-base/-/project_members) [14:36:30] trying to understand why I'm getting a failure at https://gitlab.wikimedia.org/repos/search-platform/opensearch-analysis-hebrew/-/jobs/441406#L3 [14:37:02] I have mostly no idea. Could this be a user automatically created for each build job? [14:37:41] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/query_service/files/monitor/prometheus-blazegraph-exporter.py#218 this is the query we run to determine categories lag, right? [14:38:55] inflatador: yes I think so [14:47:41] OK, I've been getting what looks like an empty result, but I might be formatting it wrong? LMK what you think dcausse https://etherpad.wikimedia.org/p/categories-lag [14:53:19] inflatador: weird you should get at least the number of triples [14:54:00] dcausse I think the problem is with my curl cmd, let me try again [14:54:06] inflatador: you should query http://localhost/bigdata/namespace/categories/sparql not blazegraph directly [14:54:28] to let nginx do the namespace mapping [14:55:34] I get 2025-02-01T20:10:13Z on wdqs2020 doing so [14:56:18] so yes something's off in the database, does not seem to be the query nor a monitoring issue :/ [14:58:20] Hmm, that I can't explain. I did set up the main hosts with categories on Friday, but didn't touch any of the full graph hosts [15:02:03] I'm sure I can reload, but I wonder if I messed something up on Friday, if this is a coincidence or what [15:02:35] inflatador: something certainly broke but unsure when [15:06:30] Feb 10 07:54:28 wdqs1020 loadCategoriesDaily.sh[787513]: 2025-02-10T07:54:28+00:00 categories daily load done [15:06:42] on wdqs1020 [15:07:14] hm.. perhams that lag check sparql query is new [15:09:47] inflatador: ah should be "max" instead of "min" in the sparql query [15:10:05] it reports 2025-02-10T05:00:01Z with max [15:15:41] or perhaps not... using min was already what was used in the icinga check... [15:17:23] possible that some wikis failed to import... [15:17:46] zh.wikipedia.org last import is at 2025-02-02T00:22:12Z [15:20:54] "processing zhwiki" -> totalElapsed=1ms, commitTime=1739001210750, mutationCount=0 [15:20:57] 0 mutation [15:22:06] "curl https://dumps.wikimedia.org/other/categoriesrdf/daily/20250210/zhwiki-20250210-daily.sparql.gz | zcat" is empty [15:22:18] mediawiki exported nothing... [15:22:21] fun... [15:25:56] dcausse ah, nice. I'm looking at those scripts for the categories migration anyway, sounds like an opportunity for sanity checking [15:28:02] inflatador: sure why not but here I think the problem is on the mediawiki side of things [15:31:06] needs to figure out which machine is running these dumps [15:35:43] seems to be snapshot1016 [15:37:21] Wikimedia\Rdbms\DBQueryError from line 1230 of /srv/mediawiki/php-1.44.0-wmf.15/includes/libs/rdbms/database/Database.php: Error 1176: Key 'rc_new_name_timestamp' doesn't exist in table 'recentchanges' [15:46:53] could be related to dumps 1.0 using analytics replicas? [16:02:11] dcausse: we're in https://meet.google.com/eki-rafx-cxi [16:02:18] oops [16:57:08] I would like to pick up T385972 next, mostly as a way to onboard on cirrus. I spent some time looking at the A/B test patches today, and get familiar with cirrus's config. [16:57:08] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [16:58:31] gmodena: sure, perhaps as a first step you could activate the MLR models that got A/B tested last december? [16:59:14] dcausse yep. Sounds good. [16:59:58] my plan of attack was to enable the mlr-2024 expriment (if I got the tag right) and a/b test it against newly built models (mlr-2025?) [16:59:59] gmodena: hopefully they're still referenced in the mw-config, (not that we want to enable all new models, except jawiki for now) [17:00:24] in terms of regressions we should be able to track perf from grafana's search dashboard, right? [17:00:34] gmodena: sounds perfect [17:00:39] dcausse ack on jawiki [17:00:58] gmodena: yes, if they perform badly hopefully it's visible on the dashboard yes [17:01:46] gmodena: the sole minor difficulty might be that zh (and ko?) do not have mlr enabled yet so for these two it's slightly more complicated than just switching the model name [17:02:44] but everything should happen in the mw-config [17:03:58] dcausse got it. Thanks for the heads up! I'll time box some time to figure out config tomorrow morning, but will send a bat signal if I get stuck. Ok? [17:05:35] gmodena: sure, anytime! [17:07:04] dcausse thanks! [17:17:23] small CR to enable metrics collection for wdqs-categories if anyone has time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/1118162 [17:41:40] errand [18:09:34] I think I lost write access to https://wikitech.wikimedia.org/wiki/Search and below pages due to the SUL migration ( https://wikitech.wikimedia.org/wiki/News/2024_Migrating_Wikitech_Account_to_SUL ). Asking in #cloud... [18:12:20] inflatador: there's no per-user acl, perhaps try to logout/login, or if this page is protected you might need more than X edits from your account to edit it [18:13:03] I don't think it's protected tho [18:18:16] dcausse ACK, it was some session goofiness...I just needed to remove all the cookies and re-login