[07:18:31] hello folks [07:40:58] the more I think about Cassandra for the ml-cache nodes the more I think that it will be a good solution for us [07:41:57] the main questio mark is latency [07:42:15] even if I think that a properly designed keyspace (well partitioned etc..) in Cassandra is fast enough [07:56:14] maybe a little less performant than Redis [07:56:26] but way more flexible (maintenance, availability, etc..) [07:56:41] we should involve Data Engineering for sure in the design [08:19:49] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) >>! In T301878#7741542, @Ottomata wrote: > I see! So you'd kinda be using ChangeProp as the Knative eventing stand in for now :) Exactly yes, after the... [10:29:51] good morning :) [11:34:07] \o [11:35:36] elukey: giving the VM creation for staging etcd's another try in a moment [11:39:16] ack! [11:39:23] I was in a meeting sorry, going afk for lunch now :) [11:39:26] hello aiko! [11:44:22] Buon appetito :) [12:38:24] elukey: ping when you're back. This VM creation is taking foever :-/ [12:50:40] unping. [13:26:07] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) > Does it make sense? Ya that makes sense! So the switchover would just be a changeprop deployment that would change the consumer's behavoir when it r... [14:12:50] klausman: o/ [14:12:59] yeah it takes a long time :( [14:13:02] did it work at the end? [14:14:22] Still waiting [14:14:35] Moritz is doing other maintenance that slows everything in codfw_A down [14:14:47] And my attempts for the other VMs are failing [14:15:27] makes sense yes [14:16:51] what kind of errors are you getting? [14:17:19] Ganeti doesn't see the added DNS entries [14:32:24] Morning all [14:34:00] morning! [14:35:06] \o [14:47:17] (03CR) 10Elukey: [V: 03+2 C: 03+2] Updates frozen-requirements after wheel cleanup. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/762901 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [14:47:47] ok I think that we can try to deploy ORES [14:47:55] any thoughts? [14:52:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10achou) Hi, I did some research on how ORES does this, in [[ https://github.com/wikimedia/ores/blob/master/ores/scoring_context.py#L67 | ores/scori... [14:56:50] elukey: o/ let's do it! [14:58:18] aiko: sure! [14:58:32] so from deploy1002, the deployment node, I tried [14:58:33] httpbb /srv/deployment/httpbb-tests/ores/test_ores.yaml --hosts=ores1001.eqiad.wmnet --http_port=8081 [14:58:47] that works, but two requests fail, the hiwiki ones [14:58:56] that is ok since the model is not deployed yet [14:59:13] so we'll be able to check ores1001 after the canary deployment [14:59:17] and see if it is all good [15:00:41] the docs to deploy to prod are https://wikitech.wikimedia.org/wiki/ORES/Deployment#Deploy_to_production [15:00:46] aiko: --^ [15:01:08] I can drive via screen sharing if you want [15:01:18] and let you do the checks [15:01:26] what do you think? [15:01:29] yess please! [15:01:31] :D [15:03:45] +1 for enthusiasm :) [15:04:24] we'll see later :D [15:05:01] You mean a sort of "you'll like it until you see it"? [15:09:12] usually an ORES deployment is an experience [15:09:26] enthusiasm tend to decrease after it :D [15:09:33] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Re-evaluate ip pools for ml-serve-{eqiad,codfw} - https://phabricator.wikimedia.org/T302701 (10akosiaris) >>! In T302701#7740625, @elukey wrote: > On ml-serve-eqiad (half way through loading ORES pods): > > ` > root@deploy1002:~# kubectl get svc -... [15:13:42] elukey: I have a review for the etcd host setup. Should I let you review it or seek someone else while you're elbows-deep in ORES guts [15:13:53] Wow, that sounds way grosser than it did in my head. [15:21:20] klausman: I can check it in ~30 mins (hopefully :D) [15:25:10] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) last commit on ores1001: ` elukey@o... [15:35:49] ORES deployment to the canary started [15:40:43] TIA [15:58:47] so we are having an issue with the new model for hiwiki damaging [15:58:52] the rest looks working [16:02:59] nope all working, scratch that [16:11:39] deployment completed [16:12:37] \o/ [16:13:41] https://ores.wikimedia.org/v3/scores/hiwiki/777/goodfaith [16:13:45] https://ores.wikimedia.org/v3/scores/hiwiki/777/damaging [16:13:47] all good [16:13:54] nothing horrible in the logs [16:19:11] we are watching https://logstash.wikimedia.org/app/dashboards#/view/ba190230-deb8-11e8-99b8-7fba019e77c2?_g=h@6c3eb7f&_a=h@f2110be [16:19:20] and some weird errors popped up [16:19:33] from the overall metrics it seems that the cluster is adjusting after the deploy [16:19:38] so we'll git it some minutes [16:19:47] I think that we'll need to tune the scap config to be more gentle [16:20:18] for example we have [16:20:19] Feature extraction error for model 1074689111 and revision goodfaith due to: Timed out after 15 seconds. [16:20:24] that is cryptic [16:21:31] ahhhhhhh [16:21:42] aiko: I am double stupid [16:21:57] we are seeing the logs because I added them :D [16:22:37] these are all the "scores errored" that are due to serious problems [16:23:00] (yes we do have scores errored that are apparently "expected", like the ones caused by non existens revisions etc..) [16:23:13] so there is no regression [16:23:17] it is just more data to check :) [16:23:57] for example [16:23:59] Feature extraction error for model 1334902099 and revision itemquality due to: JSONDecodeError: Failed to process datasource.wikibase.revision.entity_doc: Expecting value: line 1 column 1 (char 0) [16:24:09] this is something that failed silently before, now we see it [16:24:32] (as Aiko pointed out there is a bug in logging :P) [16:25:00] elukey: that's good! [16:25:20] aiko: this was the logging patch https://github.com/wikimedia/ores/pull/355/files [16:25:25] if you want you can fix it :) [16:25:43] now I see the problem, we didn't catch in review time :( [16:25:58] so for the moment let's call the deployment a success [16:27:33] \o/ [16:30:11] Well done! [16:30:21] o/ [16:30:27] nice work on the ores deployment everyone! [16:31:05] \o/ [16:31:59] accraze: o/ [16:32:05] these faiures are interesting [16:32:06] Feature extraction error for model 1074690851 and revision articletopic due to: Timed out after 15 seconds. [16:32:15] 15 seconds? [16:32:45] weird! i thought the timeout was longer? [16:33:34] but also, 15 seconds for feature extraction :O :O [16:34:06] lololol [16:34:35] yeh i mean that should be cached if it does actually does take 15 secs [16:34:55] my guess is that the extractor got stuck or something [16:36:22] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Deployment completed! The new hiwik... [16:39:16] elukey: remind me, should the installer for VMs wait for input on partition setup? [16:39:57] 'cause it does :-S [16:39:59] klausman: in theory no, d-i should get through without stopping [16:40:10] did you run puppet on install servers first? :D [16:40:12] # cat /proc/cmdline [16:40:14] BOOT_IMAGE=debian-installer/amd64/linux initrd=debian-installer/amd64/initrd.gz vga=normal auto-install/enable=true preseed/url=http://apt.wikimedia.org/autoinstall/preseed.cfg DEBCONF_DEBUG=5 netcfg/choose_interface=auto netcfg/get_hostname=unassigned netcfg/get_domain=unassigned netcfg/dhcp_timeout=60 --- console=ttyS0,115200n8 raid0.default_layout=2 BOOTIF=01-aa-00-00-75-35-eb [16:40:21] No mention of the partman recipe [16:40:53] klausman: yeah I think that the install* servers are not updated with your netboot change [16:40:57] Was I maybe too quick when booting the VM after submitting the changes? [16:41:12] But I ran `sudo cumin 'install2*' 'run-puppet-agent'` [16:41:27] Is there more needed for it to pickup changes? [16:41:58] preseed/url=http://apt.wikimedia.org/autoinstall/preseed.cfg [16:42:14] Good point [16:42:34] I usually run puppet on install* and apt* [16:42:44] let's do it and retry [16:42:44] I forgot apt [16:42:55] I have it bookmarked otherwise I forget too :D [16:43:17] is apt2001 enough? [16:43:38] I'd suggest to run puppet on all install and apt node just to be sure [16:44:54] Ok, try #2 [16:51:20] there we go [16:51:32] alright, now start the setup of the other two VMs as well [16:57:23] super [17:08:42] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/767221 when you have some time [17:27:36] folks I just got notified that tomorrow I'll not have any electricity from 17:00 to midnight my time [17:27:49] so I may need to miss the team meeting :( [17:28:24] merci [17:28:47] Do some meditation, detach from doom scrolling etc. [17:35:06] yeah :D [17:35:14] going afk now, ttl! [17:35:22] have a good rest of the day folks [17:35:23] :) [17:35:41] \o [17:49:19] bye Luca :) [20:42:28] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Ciell) This is great, thank you so much eluk...