[06:53:12] o/ [07:15:54] morning o/ [07:17:27] I am using tracemalloc to find the memory leak, so far the highest consumer (that keeps increasing) in my local tests is mwparserfromhell :( [07:17:39] more specifically, https://github.com/earwig/mwparserfromhell/blob/main/src/mwparserfromhell/parser/__init__.py#L84 [07:18:12] (tracemalloc is awesome, I simply used https://docs.python.org/3/library/tracemalloc.html#display-the-top-10) [07:18:52] aiko: we don't use mwparserfromhell directly IIRC, so your theory about a leak in the model may be the right one :( [07:19:39] I'll leave my test running for a bit to confirm [07:19:45] isaranto: --^ [07:22:59] Mornin' [07:23:56] The question is: if this is a leak in the model, why did it never happen on ORES? Or are we using a forked/change version? [07:26:39] it could also be in revscoring now that I think about it [07:27:31] klausman: not sure, but IIRC uwsgi process in ORES have a limited timespan (namely after X requests they are killed) [07:27:36] morning :) [07:30:00] it is a slow leak [07:30:15] but maybe some rev-ids trigger the allocation of more memory [07:32:35] Hmmm. If the model/revscoring loads the diff, that would explain it, since different revs have differen diff sizes [07:33:33] we do some preprocessing with revscoring too, so it may be there [07:33:41] trying to get a trace of the calls [07:33:59] so far I only see something like [07:34:00] /opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py:84: size=34.5 MiB, count=435284, average=83 B [07:44:26] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) @Isaac, thank you for the pointer. I have tested the recommendation-api with `related_articles` switched off and it ran without the errors. Below are the s... [07:58:11] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) I created the following test environment (locally): * Python script that reads from mediawiki.revision-create in Event Stream and calls Docker locally. * Instrume... [07:59:50] * elukey commuting [08:03:51] Good catch. I never found anything suspicious in the model [08:04:49] (03PS1) 10Kevin Bazira: switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) [08:05:13] elukey: how did u obtain memory usage? Iiuc it is done with tracemalloc? [08:05:51] We can use memory_profiler and it will print line by line memory usage. I just couldn't make it work with async. So with this it will be ok [08:19:03] isaranto: exactly tracemalloc, but it indicates only the line, I am trying to get the trace [08:19:16] after that we should be able to pin point where the call happens [08:21:26] isaranto: https://docs.python.org/3/library/tracemalloc.html#display-the-top-10 this is very handy, I initialized it in the __init__ and ran snapshot in predict() [08:21:41] after a bit mwparserfromhell goes up [08:21:46] and it slowly leaks [08:37:56] https://www.irccloud.com/pastebin/oh9XV2mx/ [08:40:36] does it work for you? because 'lineno' in statistics returns only one line to me [08:40:40] I am using 'traceback' [08:41:13] but now I am filtering out 'tracemalloc' because it pops up at the top of the stack :D [08:41:49] if I use 'traceback' I don't see at the top mwparserfromhell though [08:42:49] a yes, I used this at the end of the predict function when running kserve locally [08:43:37] yep yep but do you get more than one line in the prints? [08:44:16] yes [08:44:33] https://www.irccloud.com/pastebin/MTOqOJyH/ [08:46:59] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) The only reference I find in revscoring package is [[ https://github.com/wikimedia/revscoring/blob/f548b5d8cc8d414c53ec2d0be2d4d049c880484f/revscoring/feat... [08:47:02] isaranto: yep yep sorry I didn't explain myself - I am trying to get the traceback of the top one [08:47:26] basically the call stack to figure out the chain of calls [08:47:44] but so far I am failing [08:47:47] ack [08:48:25] IIUC https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py#L262 should be called [08:48:44] that is called in turn by Revision's init [08:48:51] I dont know a way to get the traceback automatically but I know how to do it using memory_profiler. You anotate every function you want to profile so you navigate the stack manually (hope what I'm saying makes some sense) [08:48:55] but who instanciates revision etc.. [08:49:09] yesyes could be a test to do [08:52:40] https://github.com/earwig/mwparserfromhell/blob/main/src/mwparserfromhell/utils.py#L73 [08:54:32] sth must be mutable in there in some way and is increasing. I'll try memory_profiler in that one [08:54:44] super [08:55:54] in the last release of mwparserfromhell they added https://github.com/earwig/mwparserfromhell/pull/303 [09:00:05] ah wait but we should use it already in theory, given our requirements.txt [09:00:07] mmm [09:00:53] (03CR) 10Elukey: [C: 03+1] switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:01:14] isaranto: do you want to keep working in here (so we can do our tests etc..) or meet? [09:01:59] lets have a quick meet [09:03:39] https://meet.google.com/sog-caqz-mpc?authuser=0 [09:04:06] (whoever wants can join) [09:29:37] (03CR) 10Kevin Bazira: [C: 03+2] switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:30:09] (03Merged) 10jenkins-bot: switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [09:36:03] I think we're getting there.. [09:36:09] https://www.irccloud.com/pastebin/78ngChxt/ [09:36:52] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) Last changes applied, we should be good to close! [09:37:14] This is as sample of the profiling I get. for example in this request after calling parse method we see an increment of 2.7MB in memory which isn't cleared after next request is scored [09:37:43] wow nice [09:37:49] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) a:03klausman [09:38:11] now I'm doing the same one level down for the `def parse_anything(value, context=0, skip_style_tags=False):` function (it is an alias for parse) [09:39:37] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) I've done 1002-1008 today, and everything went smoothly. All done! [09:39:47] Kubelet partition size increase all done! [09:46:31] nice! [09:48:18] Remind me, will we also have to tweak partman, in case we reinstall machines or add workers? [09:49:39] could be an option yes, or we remember to do it when we create new workers [09:49:56] we share the recipe with serviceops so we can't change it, in case we'd need to duplicate etc.. [09:50:11] That's a tough choice between "dealing with partman" and "have to remember random factoid when installing" :)) [09:50:29] But then again, even if we forget, it's not a super disruptive fix. [09:51:14] Hmmm. Maybe add a check to prom that alerts if our /var/lib/kubelet is smaller than 50G or so. [09:51:28] easier to constrain that to just ML machines [09:52:41] not sure, the best is probably the partman recipe [09:52:48] isaranto: proceeding with https://gerrit.wikimedia.org/r/c/operations/dns/+/957689, ok? [09:52:55] even if it is a day earlier [09:53:10] klausman: do you want to rollout the dns change? [09:53:21] can do! [09:54:46] let's do it! [09:55:08] super, so the procedure is simple - merge, jump on a auth-dns server node and run `sudo -i authdns-update` [09:55:36] what are the auth dns server names again? [09:56:02] this bit is left for exercise :) [09:57:36] I'd expect authdnsXXXX, but my known hosts file doesn't have that [09:58:01] I'd check puppet's site.pp [10:00:06] ah, so this changed earlier this year [10:00:46] So any dnsXXXX will do? [10:01:01] yep [10:01:10] ok, will merge and run update [10:01:26] and also !log it in #operations (with the ref to the task) [10:01:32] ack [10:04:08] And done. No errors. [10:04:47] $ dig ores.wikimedia.org [10:04:51] ores.wikimedia.org. 300 IN CNAME dyna.wikimedia.org. [10:05:19] super [10:07:22] klausman: could you also update our ml-team's slack thread? [10:07:26] where we listed the steps [10:09:21] ack [10:12:57] isaranto: found a way to get the stacktrace, seems all starting from [10:13:04] feature_values = list(extractor.extract(rev_id, model_features, cache=cache)) [10:13:07] wait, I have a question if right now we have ores -cname-> dyna and ores-legacy -cname-> dyna. If change 95769 just changes it to ores -cname-> ores-legacy, isn't that basically a noop? [10:15:56] elukey: so it is not from revscoring package? [10:16:21] it is yes, lemme paste one (sometimes it varies a little) [10:16:24] or u mean it starts from there? [10:16:36] exactly yes [10:17:16] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) One trace sample from `tracemalloc`: ` File "/srv/rev/model-server/model.py", line 24 kserve.ModelServer(workers=1).start([model]) File "/opt/lib/python/s... [10:17:23] isaranto: --^ [10:17:43] klausman: it is a very good point, I assumed it would have worked but I'd need to check [10:19:18] so dyna points to our LVS external IPs [10:19:57] And I suspect LVS makes the routing decision based on the Host: header? [10:19:58] we end up in the caching layer, that routes the request accordingly [10:20:08] nono LVS is a L4 LB [10:20:37] Ah, but how does the caching layer know which backend svc to use? [10:21:21] it is an ATS setting, we have this in puppet [10:21:24] - type: map [10:21:24] target: http://ores.wikimedia.org [10:21:24] replacement: https://ores.discovery.wmnet [10:21:24] - type: map [10:21:24] target: http://ores-legacy.wikimedia.org [10:21:26] replacement: https://ores-legacy.discovery.wmnet:31443 [10:21:33] so we'll need to file a puppet change [10:21:39] lemme me self -1 me [10:22:57] we need a HTTP redirect [10:23:05] there should be a service for it [10:29:56] Also note that in role/common/cache/text.yaml, the caching settings for ores and ores-kegacy are different (probably intentional, but if we switch to point one at the other we might need to update it) [10:30:28] what is the difference? [10:32:17] ores is set to normal, the other to pass [10:32:35] makes sense yes [10:32:35] I suspect "pass" means to not cache at all [10:32:39] yep [10:32:54] Thing is, I don't know if ATS ever caches POST requests [10:33:02] The the difference might be moot. [10:33:10] s/The// [10:33:34] anyway, I'm off to errands and lunch, ttyl [10:33:34] it is varnish+ATS the combination, but I don't think that they'll allow us to do it [10:33:37] too much variations [10:33:51] but it is fine (this is what we are discussing in the caching task) [10:34:02] good! :) [10:45:30] * elukey lunch! [10:47:43] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) I'm testing with a specific rev_id I found that causes a memory increase: 153877972 https://phabricator.wikimedia.org/P52539 As you can see it seems that... [10:47:56] how did you get the stack trace with tracemalloc? [10:48:04] nevermind , after lunch! [11:02:47] * isaranto lunch [13:09:27] isaranto: yep exactly with tracemalloc [13:13:09] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) I left the code running for a bit, I wanted to test disabling caching in revscoring, I ended up with the following (more precise) trace: ` File "/opt/lib/python/... [13:15:22] yes, but how do you obtain that ? with the top 10 line numbers? or sth else? [13:16:28] isaranto: added it to the task [13:16:46] a ok thanks! super cool [13:18:01] what makes me wonder is the following : I have been testing with just one rev_id and the increase in memory seems random. one time it happens and the other not [13:18:12] which is super weird [13:20:07] it is an easter egg in revscoring :D [13:20:24] it is so poetic that we found a major leak just before deprecating ores [13:20:43] the revscoring's _solve function is a mess [13:24:40] 10Machine-Learning-Team, 10Commons: Utilize ChatGPT for categorizing and extracting metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10calbon) Hi Hoi! Unfortunately we can't provide free access to ChatGPT, however we are working on hosting large language models on WMF's infrastructur... [13:28:55] I've tried explicitly clearing inputs and cache (assigning to None) but doesnt do anything. afaiu the issue is with async calls [13:29:11] async calls? [13:29:23] we don't do any with revscoring [13:29:28] (03CR) 10Klausman: fix: revscoring model server inputs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [13:32:11] not with revscoring within the model server I mean. in a synchronous setting releasing the variables from memory would do the trick (although it is a hack) [13:41:22] (03PS1) 10Elukey: revscoring: consume all data from the extractor [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445) [13:42:11] interesting! [13:42:24] lemme try it locally to see what's going on with memory [13:42:44] it is a shot in the dark, I still need to run more extensive tests, buuuut it may help [13:44:39] until now it is our best bet. I haven't found any memory leaks in any revscoring function until now. even _solve doesnt increase memory usage [13:47:00] trying to run some long tests [13:48:16] I see still issues [13:48:44] https://phabricator.wikimedia.org/P52542 [13:52:37] same issues? (sorry it is difficult to parse the traces) [13:53:03] on my side I see no rapid increase of mwparserfromhell [13:54:06] klausman: the suggestion from Traffic was to switch the ATS config [13:56:18] elukey: from my investigation so far the issue is with the generator as you suggested. I can explain what these memory traces mean in person [13:56:49] elukey: So the target/replacement thing you pasted? SGTM [13:56:49] but they show that although memory usage increases it is not caused by the extractor.py [13:57:02] * the extract function [13:58:22] isaranto: ack, so IIUC we can merge or do you prefer to wait? [13:58:50] no ok now I see mwparserfromhell in my tests at the top [13:58:58] seems slower than before [13:59:07] I tried the suggestion in gerrit and it doesnt seem to help [14:48:38] (03CR) 10Ilias Sarantopoulos: "According to some memory profiling I ran locally trying out this change, it seems that the issue persists" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [14:51:31] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Something really interesting: the following rev-id (https://es.wikipedia.org/w/index.php?diff=153880256) causes a big jump in the size of mem... [14:51:49] isaranto: --^ [14:53:18] great! at least we have a good revid candidate to test [14:54:22] just verified it as well, causes approx 2MB increase [14:55:14] the count of objects doubles, I am wondering if we can dump those [15:12:31] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) It seems that we don't use mwparserfromhell version 0.6.5 that has [[ https://github.com/earwig/mwparserfromhell/pull/303 | this impr... [15:12:48] elukey: --^ found it! [15:13:16] the solution was where you pointed earlier today. thanks for spotting that PR! [15:15:18] * elukey cries in a corner [15:15:30] pepito found it! [15:15:48] good boy pepito! [15:16:01] 🐶 [15:16:11] opening a PR to make a new release for revscoring [15:17:20] the other thing is that we are in the middle of upgrading kserve to 0.11 so we'll have to revert unless anyone has a nicer idea [15:17:33] yeah let's do it [15:17:49] we fix the leak, and then we upgrade [15:18:01] lemme revert [15:18:12] then you can apply the new revscoring, does it sound good? [15:18:41] it is weird that we use 0.6.4 though [15:19:06] isaranto: revscoring/requirements.txt:mwparserfromhell==0.6.4 [15:19:15] we don't need another release [15:19:20] it is our requirements.txt [15:19:56] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) We set it in our requirements.txt :( ` revscoring/requirements.txt:mwparserfromhell==0.6.4 ` [15:20:20] yes, lets bump that one [15:20:35] (03Abandoned) 10Elukey: revscoring: consume all data from the extractor [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey) [15:20:59] and I'll make the revscoring release later to enorce future use of that version [15:21:02] *enforce [15:21:12] (03PS1) 10Elukey: Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 [15:21:35] ok so reverting the kserve 0.11 upgrade [15:22:04] I can make the new patch and test it afterwards [15:23:47] I recall that we had to state all dependencies due to a slow performance in pip resolving deps [15:23:51] but this is crazy [15:23:53] Nice catch, guys! [15:23:58] maybe we don't need this mess anymore? [15:23:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey) [15:25:15] it depends on how you look at it. I like explicit dependency on the final application as you can debug more easily. otherwise you have no idea which version is used on your app from a specific commit [15:25:39] yeah but we can end up with stale deps [15:25:50] I'm not strong about it [15:26:29] CI is taking ages :) [15:26:41] we can allow patches (since all packages use semver) [15:27:00] so going from 0.6.4 to 0.6.5 would have happened automatically. [15:27:20] We're super lucky the new release was 2 weeks ago 😛 [15:29:08] (03CR) 10Elukey: [C: 03+2] Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey) [15:29:13] definitely [15:29:27] isaranto: let's allow patch versions for mwparserfromhell [15:29:35] ack [15:29:53] (03Merged) 10jenkins-bot: Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey) [15:31:32] isaranto: change merged, ready for the fix :) [15:31:51] great! I am building the image locally to test [15:35:01] in the meantime https://github.com/wikimedia/revscoring/pull/549 [15:35:48] approved [15:37:14] reviews faster than CI [15:37:43] isaranto: I totally forgot that we now auto-publish revscoring [15:40:53] :) [15:43:31] (03PS1) 10Ilias Sarantopoulos: revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) [15:43:51] still building the local image... [15:45:44] revoscoring package 2.11.13 is out , we can include it in the kserve upgrade since we should keep the above fix minimal [15:45:50] *revscoring [15:45:53] (03CR) 10Elukey: "Let's also bump revscoring's version :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [15:46:10] yeah exactly [15:46:19] no sorry, let's do it now [15:46:23] just to be sure [15:46:29] if we bump revscoring we'll have to bump numpy etc [15:46:36] yeah yeah sorry PEBCAK [15:46:45] (03CR) 10Elukey: [C: 03+1] revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [15:46:54] go go go [15:46:57] we are now using version 2.11.10 https://github.com/wikimedia/revscoring/blob/master/CHANGELOG.md [15:48:38] waiting for my image to build just to be sure (in case there is some other mwparser dependency that would cause 0.6.4 to be installed). just to be sure! [15:49:17] elukey: I'll merge and release [15:49:26] i'll be around as I have another meeting in 10' [15:49:56] isaranto: I can take care of the rollout if you want [15:50:01] it is late, don't worry [15:50:08] once the docker image is up I'll deploy [15:51:56] but I'll be around so I can do it dont worry [15:53:53] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [15:54:46] (03Merged) 10jenkins-bot: revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [15:57:20] making the patch in dep charts [16:08:21] anything I can help with? [16:11:38] isaranto, klausman - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959309/ [16:11:54] (Ilias take it easy you are in a meeting, we can deploy :) [16:14:12] I think that we can just do damaging and goodfaith for now [16:14:16] and do the rest tomorrow [16:16:39] lol I created a patch as well, feel free to abandon it [16:17:03] folks let's try to coordinate in here :) [16:17:47] isaranto: I also removed damaging's staging settings, since it was using another image [16:17:54] not sure if we want to fix it as well or not [16:18:25] I was using it to deploy kserve 0.11 [16:18:33] feel free to do anything with it :) [16:20:13] rolling out your patch :) [16:21:48] thanks! [16:24:39] elukey: +1'd the DNS rollback [16:25:07] klausman: if you want you can roll it out [16:25:12] will do [16:25:13] sorry for the trouble [16:25:17] Should I merge it or will you? [16:25:21] no worries! [16:25:30] go ahead [16:25:31] :) [16:25:34] ack [16:27:52] I'm here again [16:28:18] want me to deploy anything? [16:28:25] almost done [16:28:45] I am only doing goodfaith/damaging [16:28:49] we can do the rest tomorrow [16:29:24] ack [16:29:29] aaand rolled out [16:30:18] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Fix deployed to goodfaith/damaing pod environments. Let's double check tomorrow that the memory metrics are stable and then we can roll out t... [16:31:55] isaranto: all done! [16:32:04] I don't see weird errors in logstash [16:32:49] container metrics for eswiki are good [16:32:53] soooo I'd call it a day :) [16:33:08] great work team! [16:33:17] yep! \o/ [16:33:22] have a nice rest of the day folks! [16:36:12] logging off as well \o/ [16:49:03] \o [17:03:16] niceeee \o/ good job everyone!! [18:30:30] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10Isaac) > I have tested the recommendation-api with related_articles switched off and it ran without the errors. Great to hear! Thanks!