[08:15:49] morning folks! [08:20:59] hello! How are you feeling?? [08:22:00] (back in a bit) [08:56:13] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) a:05elukey→03None [08:56:33] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Cleanup NLLB200 docker image - https://phabricator.wikimedia.org/T324464 (10elukey) p:05High→03Medium Next steps: * Create a `blubber.yaml` file to generate the Dockerfile. This will bring us 100% compatible with production's standar... [08:57:38] 10Machine-Learning-Team: Fix alternate names for ML ks8 TLS certificates - https://phabricator.wikimedia.org/T306613 (10elukey) 05Open→03Declined With the k8s 1.23 upgrade we'll use PKI, so cergen's certificates will be deprecated. [09:03:04] 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10elukey) We discussed this task during the team meeting, and we are going to split the work in this way: * @isarantopoulos will check if we can fix the new revscoring single Docker image directly, so that... [09:13:44] hey Luca, I am much better today! [09:13:57] started feeling more like myself again [09:23:54] nice :) [09:27:07] klausman: o/ https://phabricator.wikimedia.org/T325132 [09:27:10] good morning :) [09:27:16] I'll take eqiad if you are ok [10:06:14] 10artificial-intelligence, 10Code-Review-Workgroup, 10Developer-Advocacy: AI which suggests best reviewers for a patch ("Patch wrangler") - https://phabricator.wikimedia.org/T155851 (10Aklapper) [10:24:36] 10Lift-Wing, 10Machine-Learning-Team, 10Research (FY2022-23-Research-October-December): Create a language agnostic model to predict reverts on Wikipedia - https://phabricator.wikimedia.org/T314385 (10Sheilakaruku) Hello great team, My name is Sheila, I'm from Nairobi, Kenya. I was recently selected to work... [10:33:57] folks I am running a long test with benthos on editquality-goodfaith in eqiad to see if I can repro the connection issue [10:34:07] please lemme know if it interfere with anything you are doing, in case I'll stop [10:35:08] damaging sorry not goofaith [10:35:13] *goodfaith [10:35:26] * elukey apt-get install spell-check-before-write-in-IRC [10:48:09] morning :) [10:48:46] isaranto: welcome back! [10:49:03] good to be back ! [10:50:11] elukey: will do codfw, [10:50:28] (sorry about being late, phone ran out of battery and thus no alarm. Had plenty of good sleep, tho :)) [10:50:52] np! Already done staging-codfw btw [10:50:57] roger [10:55:00] I'll start with the caching machines. Basically do them in the order in the ticket [11:11:00] updated what I have done so far [11:11:20] the ores pool counters should be as easy as reboot one of them at the time, but I'd need to double check [11:19:31] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [11:19:45] very interesting, I can't reproduce anymore the issue for the broken envoy conns [11:19:47] (03PS12) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [11:19:52] that is good, but I am still a little puzzled [11:23:04] elukey: if the reboot-single cookbook fails (because Cassandra was slow to start), is there any cleanup to do after the original problem is fixed? [11:24:15] klausman: in theory no, there is probably a bit of downtime left to expire but we don't really care [11:24:54] ack [11:25:14] one thing is odd: cass-a seems mostly fine, but one connection to port 9042 is still refused [11:25:55] hang on, that might have fixed itself just as I sent the message [11:26:43] yep, all green now [11:39:09] * elukey lunch! [11:39:19] I have restarted 3 ores nodes in eqiad, all good so far [11:39:29] will do the rest after lunch [11:41:59] ack. I'm currently doing etcd2* and then the non-ores machines should all be done [11:42:11] (03PS13) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [11:49:37] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [11:59:57] and etcd's donw [11:59:59] done* [12:20:22] <- lunch [12:22:31] 10Machine-Learning-Team: Remove hack from ML's blubber files - https://phabricator.wikimedia.org/T324658 (10isarantopoulos) this seems to work! ` builder: command: ["python3.7", "-m", "nltk.downloader", "omw", "sentiwordnet", "stopwords", "wordnet"] ` I built the revscoring image and tested it. the `NL... [13:16:23] (03PS14) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [13:17:19] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [13:20:04] I made all the changes and the patch is ready for review again. Also the new pipeline I added in CI works [13:30:50] you know what I love? My browser SEGVing mid-code review [13:54:37] (03CR) 10Klausman: blubber: create universal revscoring image (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:12:12] (03CR) 10Elukey: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:13:15] isaranto: o/ there are 5 open comments in the code review afaics, mostly nits but let's resolve them before another round of reviews [14:13:49] * elukey keeps rebooting ores nodes [14:14:36] (03PS15) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [14:17:06] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:20:54] elukey: I did resolve them, I just didn't hit "resolve" so that the author can verify it. if the team works differently and the author of the patch should resolve them let me know [14:26:23] isaranto: ah okok I got fooled by the gerrit UI when clicking on the remaining comments. Since sometimes the comments are related to old PSes I usually add a comment saying "this should be fixed now etc.." but not really mandatory :) [14:26:27] lemme re-check [14:27:18] (03CR) 10Elukey: blubber: create universal revscoring image (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:37:12] elukey: cool, I'll do the same then! not adding anything doesnt really help [14:39:40] (03CR) 10Elukey: blubber: create universal revscoring image (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:39:54] isaranto: added some comments --^ [14:41:22] also thanks a lot for the FIXME hack removal <3 [14:42:30] isaranto: the other question that i have is if we modified already the integration_config repo (from Releng) [14:42:43] otherwise if we merge in theory no docker image will be built [14:44:00] in theory these new bits [14:44:01] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/865670/15/.pipeline/config.yaml [14:44:14] elukey: thanks for the comments. regarding the config repo I dont remember where I wrote it 🤔 but is is merged and it works (already part of ci) [14:44:25] (03CR) 10Klausman: [C: 03+1] blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:44:29] https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-revscoring/6/console [14:45:49] klausman: ahem you closed my comment, there is an ongoing problem :D [14:46:41] isaranto: yeah but I think this is only testing the build image process with the new blubber file [14:46:47] it will not publish the image to the registry [14:47:04] oops [14:47:10] sorry 'bout that [14:48:09] isaranto: I think we need something like https://gerrit.wikimedia.org/r/c/integration/config/+/822052 [14:48:19] I thought your comment re "the above would raise...." was pointed at my proposed get() alternative (that indeed would break that way) [14:48:46] ahhh okok [14:49:06] and yes, we are a pedantic lot :) [14:49:27] I know I know Ilias is probably swearing in various languages at the moment [14:49:59] and I support the swearing against us, completely understandable [14:50:02] I'd do the same :D [14:51:31] * elukey feels a terrible colleague [14:53:07] I may have fallen down a rabbit hole, buuuut [14:53:22] I now have a command line that will give you all targets a Makefile knows of [14:53:26] make -pRrq : |awk -v RS= -F: '/(^|\n)# Files(\n|$$)/,/(^|\n)# Finished Make data base/ {if ($$1 !~ "^[#.]") {print $$1}}' | sort | egrep -v -e '^[^[:alnum:]]' -e '^$@$$'|awk -F: '{print $1}' [14:53:44] simple, yes? [14:53:46] wow [14:54:17] It's absolutely baffling that GNU Make has no onboard functionality to do this in a simple way [14:55:02] even just the first bit is hacky as it uses `:` as a target that can't exist, just to avoid building anything [14:56:05] At this point it might be easier to take the BNF for Makefiles (if there is one...) and build a parser specifically for the purpose [14:56:46] definitely not falled down a rabbit hole [14:57:18] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [15:02:15] (03CR) 10Elukey: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [15:02:35] elukey: this was merged so it would work https://gerrit.wikimedia.org/r/c/integration/config/+/866570 [15:03:24] isaranto: perfect than! I didn't see it in my local repo, missed it, good job [15:05:15] *then [15:05:34] elukey: meeting? [15:05:48] yeah sorry I got logged out and I didn't see the reminder :( [15:06:00] happens to me all the time as well :-/ [15:58:46] * elukey taking a break [16:20:02] (03PS16) 10Ilias Sarantopoulos: blubber: create universal revscoring image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) [16:27:05] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:27:29] (03CR) 10Ilias Sarantopoulos: blubber: create universal revscoring image (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/865670 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [16:35:23] 10Machine-Learning-Team: Create a pre-commit hook for inference-services repo - https://phabricator.wikimedia.org/T325198 (10isarantopoulos) [16:38:13] 10Lift-Wing, 10Machine-Learning-Team: Fix mwapi host header issue for outlink model server - https://phabricator.wikimedia.org/T325199 (10achou) [16:42:12] klausman: I have started to do some ores200X reboots as well [16:42:36] roger. if you want help with any of that, lmk [16:43:22] yeah feel free to start from 2009 and reboot some if you have time [16:43:58] I presume the cumin reboot-single cookbook is fine for the purpose? [16:44:11] (with --depool, probably) [16:44:18] yes yes with the --depool option is safe [16:44:23] ack [16:45:17] in theory a clean depool for the ores nodes is not 100% possible IIUC, all the nodes have a celery worker that picks up tasks from the queue and executes them [16:45:42] so when we stop a node the celery worker may be in the middle of a job run [16:46:13] but we cannot really do much [16:46:45] ack [16:46:45] the orespoolcounters should be redis nodes used by ores as locking/similar mechanism [16:46:56] I'll reboot them one-by-one with the same cookbook [16:48:08] ahh no wait they are instances of https://www.mediawiki.org/wiki/PoolCounter [16:48:51] elukey: wait... ores2009? It's not in the bug [16:49:12] only 1001-1009 and 2001-2008 [16:50:16] yes yes but let's do it anyway, it has a super long uptime, maybe Moritz forgot about it [16:50:25] alrighty [16:54:24] (03PS1) 10AikoChou: outlink: fix mwapi session host headers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/868131 (https://phabricator.wikimedia.org/T325199) [16:54:35] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) The issue seems not reproducible anymore. We put in place some fences (see above code reviews) to force Envoy to retry on certain TCP connection problems, tha... [16:56:06] (03CR) 10AikoChou: [C: 03+2] Create a test folder and add lua scripts for wrk [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/866360 (https://phabricator.wikimedia.org/T323613) (owner: 10AikoChou) [17:14:19] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) I didn't see any events while describing the pod and the metrics also report lower memory usage than the limit https://grafana.wikimedia.org/d/-D2KNUEGk/kuberne... [17:20:57] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) The plots below better explain the results of the tests. AS already mentioned they require further investigation but at the moment it seems that MP out of the b... [17:21:21] logging off, cya tomorrow folks! [17:21:42] klausman: completed 2004, are you ok for 2005? [17:22:26] ah yes I see that you already started it :) [17:23:27] updated the task as well, after 2005 all reboots are completed [17:24:15] all metrics are ok for Ores, perfect [17:24:26] all right going to log off for today, have a nice rest of the day folks! [17:28:28] aaagh, I just started the cookbook for 2004 [17:28:36] oh well, it'll be extra fresh [17:34:23] ok, that one's back and done as well. [17:45:29] 10Lift-Wing, 10Machine-Learning-Team: Deploy MultilingualRevertRiskModel to production - https://phabricator.wikimedia.org/T325218 (10achou) [17:45:43] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10achou) [17:47:14] (03PS4) 10AikoChou: revertrisk: upgrade to multilingual revertrisk model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/861434 (https://phabricator.wikimedia.org/T325218) [18:04:42] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Deploy MultilingualRevertRiskModel to production - https://phabricator.wikimedia.org/T325218 (10achou) The model has been uploaded to Thanos Swift: ` aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertr... [18:15:12] (03Abandoned) 10AikoChou: revertrisk: update knowledge_integrity and set publish image tag [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/867575 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [18:32:00] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10achou) [18:32:29] 10Lift-Wing, 10Machine-Learning-Team: Connect Outlink topic model to eventgate - https://phabricator.wikimedia.org/T315994 (10achou) [18:33:23] 10Lift-Wing, 10Machine-Learning-Team: Test MultilingualRevertRiskModel inference service on ml-sandbox - https://phabricator.wikimedia.org/T323613 (10achou) [18:34:05] 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou) [20:13:34] (03CR) 10DannyS712: [C: 03+2] Replace deprecated MWHttpRequest::factory [extensions/ORES] - 10https://gerrit.wikimedia.org/r/866788 (https://phabricator.wikimedia.org/T324918) (owner: 10Umherirrender) [20:33:22] (03Merged) 10jenkins-bot: Replace deprecated MWHttpRequest::factory [extensions/ORES] - 10https://gerrit.wikimedia.org/r/866788 (https://phabricator.wikimedia.org/T324918) (owner: 10Umherirrender) [23:09:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine besides: - ganwiki has a low precision (0.67) and very... [23:14:34] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira)