[00:17:47] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS b... [00:49:12] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS buste... [00:53:24] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [01:12:12] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:13:07] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [01:13:18] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:22:18] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS b... [01:40:27] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buste... [01:57:43] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS b... [02:29:52] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS buste... [02:40:31] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [02:42:03] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) @elukey all yours leaving the task open since i don't have the Packing Slip to receive the servers in Coupa [06:56:02] good morning :) [06:56:20] Wow it looks like we have new nodes in codfw :) [07:19:21] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759151/ should be sufficient, in theory, to fix draftquality [07:19:30] will try to deploy it in a bit [07:45:44] transformer up and running :) [08:23:13] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Istio gateways on ml-serve clusters spam syslog with warnings - https://phabricator.wikimedia.org/T300707 (10elukey) [08:33:41] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Istio gateways on ml-serve clusters spam syslog with warnings - https://phabricator.wikimedia.org/T300707 (10elukey) The issue seems to be https://github.com/istio/istio/pull/32387 Commit: https://github.com/istio/istio/commit/7bb5275ecdbef28b6b1c... [10:13:55] deployed the new draftquality image as well [10:18:43] IIUC we need to deploy the new version of the editquality transformer right? [10:18:46] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759214/ [10:19:43] and then we need to add the tranformer to the topic modl [10:20:03] ah well it is not yet in deployment-charts, ok :) [10:20:41] should we make a list of models that we'd need for production? [11:04:40] Hi everyone [11:04:40] as Andy recommended to me, I am currently trying to run the editquality reverted_detection_demo.ipynb notebook to try to understand a bit how the wikimedia ML team and project work. [11:04:40] https://github.com/wikimedia/editquality/blob/master/ipython/reverted_detection_demo.ipynb [11:04:40] Since the notebook is not runnable anymore (the demo *json.bz2 files are missing, or replaced by *tsv.bz2 files, it's not clear because there is a mix between json and tsv files in the comments and code), I worked on solving it by reading the content of the tsv file instead of the json ones (see end of Part 2). [11:04:40] Should I create an issue or task somewhere to track my PRs (one in editquality and one in revscoring)? [11:08:30] SiMaig: Hi! Not sure what you discussed with Andy (he has a lot more context than me), but a phabricator task may be a good start to track what you are doing [11:55:15] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) [11:56:09] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) 05Open→03In progress Working locally on a new function to read observation from tsv files. [11:57:25] Task created. I'm not sure I respected all requirements for a Phab task. Don't hesitate to give feedback and modify the task description if needed [11:58:22] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) a:03Simonmaignan [11:58:45] Is there a specific instruction to link Github PRs to a Phab Task? [12:46:51] SiMaig: You can mention the bug in the github PR with something like Bug: T300730, or simply copy the link the task [12:46:54] thanks for the work! [14:52:32] Morning all! [15:05:46] morning! [15:24:47] move all the ml-serve-ctrl* nodes to overlay fs (instead of device mapper) [15:24:49] *moved [15:25:10] this is the starting point to migrate all the nodes to overlay fs, it will be shared with service ops [15:31:30] 10Machine-Learning-Team, 10serviceops: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) [15:59:51] o/ [16:01:02] Just a reminder, no team meeting today because there is the department meeting [16:16:16] ack! [16:17:46] Aren't the team meetings on Thursdays (bi-weekly), like mentioned on the mediawiki team page: https://www.mediawiki.org/wiki/Machine_Learning [16:17:50] ? [16:18:09] Or are you talking about another team meeting? [16:21:47] we meet on Monday and Wednesday as team, the thursday bi-weekly thing is the twitch live stream that chrisalbon does for the broader community interested in what we do etc.. [16:22:17] accraze: o/ [16:22:41] when you have a moment we can review/merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759214 [16:23:43] elukey: +1'd ! [16:24:50] Thanks @elukey. By team, do you mean the Wikimedia ML employee or can everyone contributing can also take part? [16:24:51] 10Machine-Learning-Team, 10serviceops: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) [16:25:20] SiMaig: only the WMF team yeah [16:26:05] Simaig that is the livestream which I am just about to restart after a month of chaos in my life. I was referring to the team meeting. Typically the team meeting is not public just to keep the conversation focused (and frankly nobody else really cares about me talking about changes to the HR policy around PTOs or something like that) but volunteers have attended in the past [16:26:35] thanks for the clarification [16:27:55] we are always available in here (IRC) to answer doubts and questions of course :) [16:28:07] true [16:28:34] We are here every day, often watching elukey bash his head against his desk in frustration about some k8s things [16:28:48] I noticed that and am glad and grateful for your support ;) [16:28:56] 😅 [16:33:58] accraze: mmm [16:33:59] OCI runtime create failed: container_linux.go:344: starting container process caused "exec: \"python3\\\"\": executable file not found in $PATH": unknown [16:34:30] the editquality transformers are not coming up [16:36:13] ohhhh [16:37:04] lemme see what's going on, that image might need to use `python3.7` [16:38:25] weird, python3 should work anyway [16:40:08] i mean the python version defined in the blubberfile should be python3.7 [16:40:28] working on a quick patch [16:41:23] ah I see, the editquality's blubber file has version: python3.7 [16:45:49] I have a question/remark about the editquality demo notebook I'm working on (https://phabricator.wikimedia.org/T300730 and its associated branch https://github.com/Simonmaignan/editquality/blob/T300730-update-editquality-demo-jupyter-notebook/ipython/reverted_detection_demo.ipynb). [16:45:49] I am able now to run the whole notebook using the training and testing data contained in the demo datasets tsv files uploaded on the repo (https://github.com/Simonmaignan/editquality/tree/T300730-update-editquality-demo-jupyter-notebook/datasets/demo). [16:45:49] However, the model performs really poorly (especially the false negatives) -> See notebook output at the end of Part3. [16:45:49] Is it normal, could something be wrong with my configuration or could something be wrong with the model training? [16:45:50] PS: if you try to run the modified notebook, you will also need this change in revscoring: https://github.com/Simonmaignan/revscoring/tree/T300730-update-editquality-demo-jupyter-notebook [16:49:52] elukey: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/759272 [16:54:01] SiMaig: That's an interesting question and honestly part of a larger conversation we've been having about some of our older models. Maybe add your thoughts/analysis to the phab task as well so we can include it in the discussion about what future steps to take with the legacy models. [16:57:35] Yes definitely what accraze said [16:58:05] Some of the longstanding models perform poorly AND are long enough to certainly must have serious model drift issues [16:58:59] accraze: the change LGTM but I don't get why python3 vs python3.7 fixes the problem [16:59:30] i think its due to how the python binary is named in the buster image [17:01:01] (double checking though...) [17:01:14] (same) [17:01:48] so I ran the docker image of the transformer, installed python3 [17:01:53] and everything works [17:02:26] ahh no interesting, if I use python3 as entrypath I can repro [17:02:40] but same with python3.7 [17:03:05] this with docker-registry.wikimedia.org/buster:20220123 [17:03:15] going to try with the editquality one [17:05:27] i think it might be related to when the image is being built, the version of python installing deps was different than the runtime version [17:13:07] accraze: I can't really repro with our image :( [17:13:26] Are there coding rules or a coding style to respect inside the wikimedia repositories? If yes where can I find it? (didn't find it under mediawiki or wikitech) [17:14:49] we don't have strict rules, also depends on the team/language combination etc.. [17:14:55] (some are more strict than others) [17:15:09] we usually add linters to enforce what we care about, at least PEP-wise [17:15:27] anything specic that you'd like to know? I mean, python etc.. [17:15:32] I can try to look up something [17:16:20] for python, we've been using the black formatter during our testing pipeline: https://github.com/psf/black [17:22:17] accraze: if the new image is ready we can try to deploy it [17:22:45] Nothing particular yet. [17:22:45] I will use the black linter mentioned by @accraze [17:24:31] elukey: yeah new image is ready `2022-02-02-170500-publish` [17:24:45] im still not able to repro either tho [17:25:02] but the image seems to be ok from a quick manual test [17:26:25] accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759282/ [17:26:56] +1'd! [17:37:35] accraze: still not working, same error [17:37:45] hmmmm..... [17:38:29] ahhhhh I found it [17:39:23] accraze: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/759286 [17:39:42] didn't see it before [17:41:30] omg [17:41:47] lol well at least it was only a typo [17:41:59] +2'd! [17:49:19] cool the publish pipeline is running, new working image should be ready soon [17:58:20] elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759290 [17:58:26] ^ this should do it [18:19:36] accraze: progress! [18:19:37] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bin' [18:19:45] aha! [18:19:46] I think that we are missing the STORAGE_URI [18:19:54] and then it should work [18:20:03] ^^ similar to draftquality issues from yesterday [18:31:26] accraze: mmm one qs though - do we need the model on the editquality's transformer? [18:32:41] yeah :( the revscoring model contains a list of features that need to be extracted [18:32:43] https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revscoring/editquality/transformer/editquality_transformer/editquality_transformer.py#47 [18:32:54] ack then :( [18:34:13] accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/759294 [18:36:58] +1'd [18:37:36] i think the only way we can avoid loading the revscoring model during the transformer is to store the features somewhere like feast etc. [18:40:51] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) 05In progress→03Resolved I opened 2 PRs that are supposed to solve this issue: - [[ https://github.com/wikimedia/editquality/pull/237 | editquality PR ]] that update the Jupiter... [18:44:50] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) I am now able to run entirely the Jupyter demo notebook with the 2 PRs mentioned above. However, I noticed that, even we the model is trained and tested without error, its performance is... [18:47:19] 10Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730 (10Simonmaignan) 05Resolved→03In progress Sorry, I guess I wasn't suppose to change the task status to resolved before the PRs are merged. [18:53:03] accraze: it is all good experience that we are making with transformers etc.. but at this point it seems not very worth it for revscoring-based models [18:53:32] agreed.... especially now that the model needs to be loading twice :/ [18:54:15] (doesn't work, I'll restart tomorrow morning, need to go :) [18:54:18] * elukey afk! [18:54:31] see ya elukey! [22:57:58] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): ML Sandbox Transformer Configuration - https://phabricator.wikimedia.org/T299972 (10ACraze) I think I've got the networking issue solved. The top-level isvc was unable to route to the transformer, because my cluster-local-gateway did not have the ports confi... [23:00:31] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): ML Sandbox Transformer Configuration - https://phabricator.wikimedia.org/T299972 (10ACraze) @kevinbazira - can you try hitting enwiki-articlequality on the ml-sandbox to confirm the transformer routing works for you too? I have a test script in my home_dir i...