[10:55:19] lunch [11:10:35] lunch 2 [13:55:18] o/ [14:10:26] gehel: if you have a couple minutes: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/887300 [14:10:32] looking [14:11:02] inflatador: this is implemening the workaround on the ticket you found yesterday ^ [14:11:04] CODFW switch upgrade happening in CODFW now, you can follow along in sre or operations if interested [14:11:20] inflatador: thanks for the heads up! [14:11:33] dcausse 👀 [14:12:09] tested it and it salvaged the savepoints that refused to work yesterday [14:12:30] making new ones so that we can resume where we were [14:13:04] Oh cool, I guess we can pick up where we left off once this is merged [14:13:47] I tested with a snapshot and created new savepoints that we'll be able to use even with the old version I think [14:15:21] inflatador: and this is the corresponding patch in the deploy repo to make that option usable: https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/887318 [14:21:32] dcausse ACK, left a comment for you [14:21:41] thanks! looking [14:21:55] entirely possible I missed something [14:23:39] np let's discuss this during our meeting [14:48:39] inflatador / ryankemper: about the upcoming DC switch over. Do we plan to move all the WDQS traffic to codfw only? Do we now have enough capacity? [15:00:50] dcausse will be 1-2m late [15:00:55] np [15:03:37] OK, I'm here [16:03:38] \o [16:06:43] o/ [16:23:26] realized while porting mjolnir, we had the rules/tests about specific pools based on resource usage but they never applied to mjolnir since it was using the hook directly and not the SparkSubmitOperator (it now uses the operator) [16:24:18] but i realize i'm not sure whats appropriate :) We have the rule about mem >= 450 and < 800G should be not in the default pool, but it seems we didn't actually have anything in that range since we only have the default and sequential pool. [16:26:15] that's a nogo zone :) [16:27:05] we can say whatever we want I guess? e.g. 2jobs max? [16:27:41] yea we can say whatever, we could define a specific pool for mjolnir and give it 2 jobs perhaps. Although we do put the hyperparam job into the sequential pool as it tends to scale large for enwiki [16:28:29] the mjolnir tasks, particularly the one that read in the feature matrix, do an auto-sizing where they read the matrix dimensions and allocate some number of bytes per location in the matrix (20-30) [16:34:12] the MjolnirOperator will disappear? [16:35:24] yes, replaced with an AutoSizeSparkSubmitOperator that extends from SparkSubmitOperator. We still need to run code at execution time to do the auto-sizing, but it's more limited and generic now [16:35:45] somewhat generic, the auto-sizing expects to read a json file containing metadata about a matrix [16:36:14] otherwise would have to do some wonkiness to re-create the environment handling [16:36:16] @team: Value Refresh workshop signup sheet, please have a look and register!: https://docs.google.com/spreadsheets/d/1YYPeHOG3uaLRrzYU6HFF_VVwRIhSqne1-trxOSvNHbA/edit#gid=1085172580 [16:37:15] i've got that all mostly written, and have mjolnir imported to gitlab along with a release published. Goal today is to get an airflow dev instance running and actually run the mjolnir dag in yarn [16:37:38] ok [16:55:21] ebernhardson for the airflow v2 stuff , LMK if you're able to use https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-search as the instance-specific scap repo [16:57:15] inflatador: sure, want me to setup the contents? [16:59:00] ebernhardson if it's just what's documented at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Creating_a_new_Airflow_Instance I'll do it. If more/other steps LMK [16:59:21] inflatador: it should just be that. I would probably copy one of the existing repositories and change references as necessary. [17:00:11] inflatador: in terms of usage, i think this patch is the one that makes scap recognize it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883678/1/hieradata/role/common/deployment_server/kubernetes.yaml [17:00:12] ACK, I'll take care of that. Working out now, but will do it when I get back in ~40 [17:00:21] kk [17:45:12] back [18:17:00] ebernhardson: if you have a couple mins: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/887375/ [18:21:22] sure, sec [18:21:35] airflow Q: I would like to re-run an hourly jobs for all the hours since ~2022-10-31, can I use airflow for this or is it better to schedule a run manually (relaxing the partition selector so that it does not run hourly), I guess the later is better. [18:22:02] not to pile on, but the new repo airflow scap repo is ready(?) https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-search I'm heading to lunch but we should be able to move things along at the SRE pairing in ~1h [18:22:29] dcausse: hmm, that's alot of hourly jobs. You can use the airflow cli to re-run them in a single command instead of a bunch of clicking in the UI. But if you can run a single command that reads all those dates it might be better off [18:23:12] yes I think so too, will do that monthly+daily instead [18:27:06] dcausse: looks reasonable, if i undertand this will generate updates for events where the api responded with a newer revision id than was in the event being processed? [18:27:50] essentially creating another event for the latest revision id [18:28:39] the "newer_revision_seen" is triggered when the flink app see a revision_id that's newer than the one seen in its state, the reconciliation job (this job) simply ignored those previously [18:29:13] ahh, for some reason i was thinking that EntityState came from the api, but makes sense it comes from the flink state [18:29:17] but we found that page-undelete events (https://phabricator.wikimedia.org/T329064) were not emitted from MW [18:30:02] so that would mean further edits to the un-deleted page were not processed? [18:30:09] yes exactly [18:31:46] that does seem troubling, this seems reasonable. I've +2'd that [18:33:17] what it'll do is that it'll salvage entities that have had an edit after being restored but not the other ones that were just undeleted without subsequent edits [18:33:56] thanks! [19:26:32] back [19:50:04] o/ Any thoughts on whats up if I see this message when trying to install cirrus & elastic and create indexes? [19:50:07] Validation Failed: 1: type is missing;2: type is missing;3: type is missing etc.... [19:50:34] full stacktrace can be found at the bottom of https://github.com/wmde/wikibase-release-pipeline/pull/398#issuecomment-1421352816 [19:50:40] ebernhardson: you might know ^ [19:51:04] addshore: looks like you are using an elastic 6.x version of cirrus with elastic 7 [19:51:20] ooooo [19:51:49] addshore: the "type" is something that was removed in elastic 7, we had some transition code that i thought made it into the last MW release but have to check. Which version are you using? [19:52:01] so my ES version is 6.8.23, so i guess im using old ES and new cirrus? [19:52:10] * addshore reads about which what is compatible with which thing [19:52:19] addshore: ahh, yea that could be as well. [19:52:34] MediaWiki 1.39+ require Elasticsearch 7.10.2 (6.8.23+ is possible using a compatibility layer) [19:52:42] yes, so either I need ot upgrade ES, or use this compat layer [19:52:43] thanks! [19:52:54] n[ [19:52:56] np [19:53:38] what version is used in wmf prod currently? I see I need 7.10.2 or above, should I just go with 7.17.8 ? [19:53:57] we use 7.10.2, 7.10.3+ is release under a non-osi approved license [19:54:10] got it, I'll be using 7.10.2 for now then too! [20:34:44] does es 6 -> 7 require a re index? [20:34:58] (at the same time was mw 1.38 -> 1.39) [20:39:10] addshore: i don't know if it strictly requires it, but probably good to do [20:39:40] internally we did have to do a full reindex prior to the switch [20:52:02] okay! [20:53:22] ebernhardson ryankemper got a patch up for the VM, according to https://wikitech.wikimedia.org/wiki/Ganeti#Update_the_DHCP_config_with_the_MAC_of_the_new_VM this is the next step in provisioning [20:53:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/887408/ [20:54:03] inflatador: looks good minus one tiny nit (1 extra newline) [20:54:40] ryankemper ACK, will fix [20:54:42] * ryankemper is going afk for 30 mins, tech just showed up to look at our leaking water heater [22:12:12] i see an-airflow1005 is up, although it's not accepting my ssh connection yet. maybe still needs puppet to set it up, or i didn't set the right groups for access [22:12:24] well, it accepts the connection but asks me for a password [22:14:13] ebernhardson yeah, it's running Puppet [22:15:51] OK, it's known to Cumin but not to the known hosts automation yet [22:16:04] * inflatador shrugs [22:16:46] can still wait for a bit, these things aren't instant :) [22:17:22] Good VMs come to those who wait ;P [23:33:25] * ebernhardson shouldn't have tried to do a dev instance on stat1005...it was idle when i started, but now it's cpu is fully occupied :(