[05:11:11] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/783843 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:17:00] (03Merged) 10jenkins-bot: editquality: fix incorrect values in augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/783843 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:28:48] (03CR) 10Kevin Bazira: articlequality: add the ORES augmented feature output (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778248 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:30:38] (03CR) 10Kevin Bazira: [C: 03+2] draftquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778225 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:31:13] (03PS4) 10Kevin Bazira: draftquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778225 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:40:29] (03CR) 10Kevin Bazira: [C: 03+2] topic: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778250 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:41:05] (03CR) 10Kevin Bazira: [V: 03+2 C: 03+2] draftquality: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778225 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:41:45] (03CR) 10Kevin Bazira: [V: 03+2 C: 03+2] topic: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778250 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:41:52] (03PS5) 10Kevin Bazira: topic: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778250 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [05:42:01] (03CR) 10Kevin Bazira: [V: 03+2 C: 03+2] topic: add the ORES augmented feature output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/778250 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [06:48:58] hello folks [07:01:53] I am deploying the isvc pods to ml-serve-codfw, after that the cluster re-init will be completed [07:02:29] super happy about it [07:03:03] woohoo .. [07:42:45] the codfw cluster should be done! [08:28:30] \o/ [08:30:53] (03PS1) 10Elukey: Update cp35 wheels to their cp37 version [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/784631 (https://phabricator.wikimedia.org/T303801) [08:32:51] (03PS2) 10Elukey: Update cp35 wheels to their cp37 version [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/784631 (https://phabricator.wikimedia.org/T303801) [08:43:00] this is an attempt to move to python 37 --^ [08:43:23] I am trying to access deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud but I get failures [09:07:28] so I got access, but for some reason the wheels submodule gets into a messy state [09:08:16] I'll try to clean up everything and run puppet [09:10:43] kevinbazira: you can deploy your changes to eqiad/codfw (forgot to ping you earlier :) [09:11:43] elukey thanks for the merge. starting deployment now .. [09:29:51] both eqiad and codfw deployments have been completed successfully. [09:29:51] checking pods now ... [09:35:06] 3/4 new pods are up and running: [09:35:07] NAME READY STATUS RESTARTS AGE [09:35:07] svwiki-damaging-predictor-default-6d8k2-deployment-54d645dsdnpf 3/3 Running 0 11m [09:35:07] svwiki-goodfaith-predictor-default-pcr5b-deployment-556db5872zk 3/3 Running 0 10m [09:35:07] tawiki-reverted-predictor-default-s7zqx-deployment-d7f58794cktx 3/3 Running 0 4m19s [09:35:09] translatewiki-reverted-predictor-default-bdq6t-deployment-2jdbf 1/3 CrashLoopBackOff 5 4m18s [09:35:44] the translatewiki pod is running into a CrashLoopBackOff issue. [09:35:44] this could be caused by the model upload. investigating now ... [09:37:46] kevinbazira: there are multiple ways to check what's wrong [09:38:02] with the kube_env credentials, you can check the pods logs [09:41:22] for example: [09:41:23] kubectl logs translatewiki-reverted-predictor-default-bdq6t-deployment-2jdbf -n revscoring-editquality-reverted [09:41:43] the cli will ask what container you want to inspect, like "storage-initializer" [09:42:12] in this case I see no issues in the logs [09:43:29] with `kubectl describe pod translatewiki-reverted-predictor-default-bdq6t-deployment-2jdbf -n revscoring-editquality-reverted` it seems that the readiness probes are failing [09:50:47] ahhh there you go, I missed the logs for kserve-container [09:50:47] File "/opt/lib/python/site-packages/editquality/feature_lists/translatewiki.py", line 4, in [09:50:50] import langdetect [09:50:53] kevinbazira: --^ [09:50:55] ModuleNotFoundError: No module named 'langdetect' [09:51:00] is there anything special for translatewiki? [09:51:15] maybe we need to update the docker images [09:52:31] Ummm ... I am not aware of the 'langdetect' module. [09:52:45] let me dig into it ... [09:57:11] yep, translate wiki uses langdetect here https://github.com/wikimedia/editquality/blob/1a4ba8333b7aaa9d5ac67e312b8077827e54bf46/editquality/feature_lists/translatewiki.py#L4 [10:02:50] let me add langdetect to the editquality image [10:11:45] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Fix translatewiki kserve container CrashLoopBackOff issue - https://phabricator.wikimedia.org/T306501 (10kevinbazira) [10:17:50] (03PS1) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [10:23:58] going afk for lunch! [10:24:57] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Fix translatewiki kserve container CrashLoopBackOff issue - https://phabricator.wikimedia.org/T306501 (10kevinbazira) [10:34:44] (03PS1) 10Kevin Bazira: editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784653 (https://phabricator.wikimedia.org/T306501) [10:37:13] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence: Research Project Idea: Use AI to suggest improvements to patches uploaded to gerrit - https://phabricator.wikimedia.org/T195235 (10hashar) 05Open→03Declined [10:57:01] (03PS2) 10Kevin Bazira: editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784653 (https://phabricator.wikimedia.org/T306501) [11:08:59] elukey I've pushed a patch. please help review when you get a minute: https://gerrit.wikimedia.org/r/784653 [11:34:17] kevinbazira: lgtm, one qs - is the bump of the base image expected? [11:34:40] (if so it may be useful to add the note in the commit msg so people can have a confirmation) [11:37:30] (03CR) 10Elukey: [C: 03+1] editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784653 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [11:37:36] (03PS1) 10Kevin Bazira: editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784659 (https://phabricator.wikimedia.org/T306501) [11:42:08] (03CR) 10Elukey: [C: 03+1] editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784659 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [11:44:50] elukey I've added the note about the latest buster: https://gerrit.wikimedia.org/r/784659 [11:48:39] (03CR) 10Kevin Bazira: [C: 03+2] editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784659 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [11:56:56] (03Merged) 10jenkins-bot: editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784659 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [11:59:12] (03Abandoned) 10Kevin Bazira: editquality: fix translatewiki CrashLoopBackOff issue [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/784653 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [12:28:45] kevinbazira: merged, you can deploy :) [12:29:13] great. thanks for the merge. deploying now ... [12:38:33] deployed successfully but the CrashLoopBackOff issue still exits in the translatewiki pod [12:38:34] I think the new image had not yet been generated before the last patch. [12:38:34] 2022-04-20-051759-publish was the latest before the last patch now I see 2022-04-20-115717-publish as the latest editquality image. [12:38:34] pushing a new patch for this. [12:42:03] (03PS1) 10Elukey: Update cp35 wheels to their cp37 version [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/784664 (https://phabricator.wikimedia.org/T303801) [12:42:25] (03Abandoned) 10Elukey: Update cp35 wheels to their cp37 version [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/784631 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [12:42:53] (03CR) 10Elukey: [C: 03+2] Update cp35 wheels to their cp37 version [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/784664 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [12:44:40] (03PS2) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [13:07:19] (03PS3) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [13:24:05] I am trying with https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/784649 to move us to python37 with the fewest amount of changes [13:24:28] I created a python37 branch on the ORES wheels gerrit repo, and I merged a change in there with new wheels [13:24:37] it worked nicely, uwsgi didn't complain [13:24:40] but celery did [13:24:57] File "/srv/deployment/ores/deploy-cache/revs/f60c6d2fc9e2b55be2f1eb48394710df94be524e/venv/lib/python3.7/site-packages/celery/backends/redis.py", line 21 [13:25:00] from . import async, base [13:25:03] ^ [13:25:05] SyntaxError: invalid syntax [13:25:41] celery 4.1.1 seems not compatible with python37 [13:26:43] yes we need celery 5.2 sigh [13:26:49] https://pypi.org/project/celery/ [13:27:04] this is a big change for us [13:27:11] but maybe it is something that can work [13:35:22] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) I am testing new wheels, build on Buster and python37, with https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/784649. The idea is to have a python37 bra... [13:47:33] and also celery 5.x has a different sets of parameters [13:47:44] but for us it should be a minimal change [13:48:07] the big question that I have in mind is if celery 5.x clients can work on Redis alongside 4.x ones [13:48:40] if yes, we will be able to reimage/upgrade one ores node at the time [13:48:53] otherwise we'll need to depool a DC, reimage all the nodes, repool [13:49:33] both options are viable, nothing major [13:50:06] need to go get groceries, bbl! [14:52:08] o/ i'm not too familiar with all the context here, but DataHub has some support for keeping track of ML models and metadata in the data catalog: https://datahubproject.io/docs/rfc/active/1812-ml_models/ [14:52:10] cc btullis also [15:16:50] ottomata: interesting thanks! [15:23:20] 10Machine-Learning-Team: Add 4 new Kubernetes worker nodes to ml-serve-eqiad - https://phabricator.wikimedia.org/T306545 (10elukey) [15:24:00] kevinbazira: you can deploy if you want! [16:01:52] (03PS1) 10Elukey: Bump celery to 5.2.6 and add its new dependencies [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/784727 (https://phabricator.wikimedia.org/T303801) [16:02:30] (03PS2) 10Elukey: Bump celery to 5.2.6 and add its new dependencies [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/784727 (https://phabricator.wikimedia.org/T303801) [16:05:07] kevinbazira, aiko o/ [16:05:29] we are in the team meeting, but if you can't join now don't worry (the meeting was moved an hour earlier) [16:06:57] ohhhh what is the meeting link? [16:07:55] same one! [16:08:35] I can join but I don't see it on my calendar [16:09:03] ahh interesting you are not in it, something weird happened [16:09:31] aiko: can you check now? [16:10:01] yep see it now [16:10:04] super [16:13:25] (03CR) 10Elukey: [C: 03+2] Bump celery to 5.2.6 and add its new dependencies [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/784727 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [16:16:20] (03PS4) 10Elukey: Update scap settings for the Python 3.7 migration [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/784649 (https://phabricator.wikimedia.org/T303801) [16:56:43] * elukey afk! [17:06:01] elukey, it's finally fixed - the translatewiki pod is now up and running. thanks for your help today. [17:06:01] NAME READY STATUS RESTARTS AGE [17:06:01] translatewiki-reverted-predictor-default-bzbms-deployment-jxfsq 3/3 Running 0 80s [17:10:11] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Fix translatewiki kserve container CrashLoopBackOff issue - https://phabricator.wikimedia.org/T306501 (10kevinbazira) 05Open→03Resolved The CrashLoopBackOff issue is finally fixed - the translatewiki...