[06:53:12] <elukey>	 o/
[07:15:54] <aiko>	 morning o/
[07:17:27] <elukey>	 I am using tracemalloc to find the memory leak, so far the highest consumer (that keeps increasing) in my local tests is mwparserfromhell :(
[07:17:39] <elukey>	 more specifically, https://github.com/earwig/mwparserfromhell/blob/main/src/mwparserfromhell/parser/__init__.py#L84
[07:18:12] <elukey>	 (tracemalloc is awesome, I simply used https://docs.python.org/3/library/tracemalloc.html#display-the-top-10)
[07:18:52] <elukey>	 aiko: we don't use mwparserfromhell directly IIRC, so your theory about a leak in the model may be the right one :(
[07:19:39] <elukey>	 I'll leave my test running for a bit to confirm
[07:19:45] <elukey>	 isaranto: --^
[07:22:59] <klausman>	 Mornin'
[07:23:56] <klausman>	 The question is: if this is a leak in the model, why did it never happen on ORES? Or are we using a forked/change version?
[07:26:39] <elukey>	 it could also be in revscoring now that I think about it
[07:27:31] <elukey>	 klausman: not sure, but IIRC uwsgi process in ORES have a limited timespan (namely after X requests they are killed)
[07:27:36] <elukey>	 morning :)
[07:30:00] <elukey>	 it is a slow leak
[07:30:15] <elukey>	 but maybe some rev-ids trigger the allocation of more memory
[07:32:35] <klausman>	 Hmmm. If the model/revscoring loads the diff, that would explain it, since different revs have differen diff sizes
[07:33:33] <elukey>	 we do some preprocessing with revscoring too, so it may be there
[07:33:41] <elukey>	 trying to get a trace of the calls 
[07:33:59] <elukey>	 so far I only see something like
[07:34:00] <elukey>	 /opt/lib/python/site-packages/mwparserfromhell/parser/__init__.py:84: size=34.5 MiB, count=435284, average=83 B
[07:44:26] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) @Isaac, thank you for the pointer. I have tested the recommendation-api with `related_articles` switched off and it ran without the errors.  Below are the s...
[07:58:11] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) I created the following test environment (locally):  * Python script that reads from mediawiki.revision-create in Event Stream and calls Docker locally. * Instrume...
[07:59:50] * elukey commuting
[08:03:51] <isaranto>	 Good catch. I never found anything suspicious in the model 
[08:04:49] <wikibugs>	 (03PS1) 10Kevin Bazira: switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890)
[08:05:13] <isaranto>	 elukey: how did u obtain memory usage? Iiuc it is done with tracemalloc? 
[08:05:51] <isaranto>	 We can use memory_profiler and it will print line by line memory usage. I just couldn't make it work with async. So with this it will be ok
[08:19:03] <elukey>	 isaranto: exactly tracemalloc, but it indicates only the line, I am trying to get the trace
[08:19:16] <elukey>	 after that we should be able to pin point where the call happens
[08:21:26] <elukey>	 isaranto: https://docs.python.org/3/library/tracemalloc.html#display-the-top-10 this is very handy, I initialized it in the __init__ and ran snapshot in predict()
[08:21:41] <elukey>	 after a bit mwparserfromhell goes up
[08:21:46] <elukey>	 and it slowly leaks
[08:37:56] <isaranto>	 https://www.irccloud.com/pastebin/oh9XV2mx/
[08:40:36] <elukey>	 does it work for you? because 'lineno' in statistics returns only one line to me
[08:40:40] <elukey>	 I am using 'traceback'
[08:41:13] <elukey>	 but now I am filtering out 'tracemalloc' because it pops up at the top of the stack :D
[08:41:49] <elukey>	 if I use 'traceback' I don't see at the top mwparserfromhell though
[08:42:49] <isaranto>	 a yes, I used this at the end of the predict function when running kserve locally
[08:43:37] <elukey>	 yep yep but do you get more than one line in the prints?
[08:44:16] <isaranto>	 yes
[08:44:33] <isaranto>	 https://www.irccloud.com/pastebin/MTOqOJyH/
[08:46:59] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) The only reference I find in revscoring package is [[ https://github.com/wikimedia/revscoring/blob/f548b5d8cc8d414c53ec2d0be2d4d049c880484f/revscoring/feat...
[08:47:02] <elukey>	 isaranto: yep yep sorry I didn't explain myself - I am trying to get the traceback of the top one
[08:47:26] <elukey>	 basically the call stack to figure out the chain of calls
[08:47:44] <elukey>	 but so far I am failing
[08:47:47] <isaranto>	 ack
[08:48:25] <elukey>	 IIUC https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py#L262 should be called
[08:48:44] <elukey>	 that is called in turn by Revision's init
[08:48:51] <isaranto>	 I dont know a way to get the traceback automatically but I know how to do it using memory_profiler. You anotate every function you want to profile so you navigate the stack manually (hope what I'm saying makes some sense)
[08:48:55] <elukey>	 but who instanciates revision etc..
[08:49:09] <elukey>	 yesyes could be a test to do
[08:52:40] <isaranto>	 https://github.com/earwig/mwparserfromhell/blob/main/src/mwparserfromhell/utils.py#L73
[08:54:32] <isaranto>	 sth must be mutable in there in some way and is increasing. I'll try memory_profiler in that one
[08:54:44] <elukey>	 super
[08:55:54] <elukey>	 in the last release of mwparserfromhell they added https://github.com/earwig/mwparserfromhell/pull/303
[09:00:05] <elukey>	 ah wait but we should use it already in theory, given our requirements.txt
[09:00:07] <elukey>	 mmm
[09:00:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[09:01:14] <elukey>	 isaranto: do you want to keep working in here (so we can do our tests etc..) or meet?
[09:01:59] <isaranto>	 lets have a quick meet
[09:03:39] <isaranto>	 https://meet.google.com/sog-caqz-mpc?authuser=0
[09:04:06] <isaranto>	 (whoever wants can join)
[09:29:37] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[09:30:09] <wikibugs>	 (03Merged) 10jenkins-bot: switch off related_articles endpoint for the LiftWing instance [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958971 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira)
[09:36:03] <isaranto>	 I think we're getting there..
[09:36:09] <isaranto>	 https://www.irccloud.com/pastebin/78ngChxt/
[09:36:52] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) Last changes applied, we should be good to close!
[09:37:14] <isaranto>	 This is as sample of the profiling I get. for example in this request after calling parse method we see an increment of 2.7MB in memory which isn't cleared after next request is scored
[09:37:43] <elukey>	 wow nice
[09:37:49] <wikibugs>	 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) a:03klausman
[09:38:11] <isaranto>	 now I'm doing the same one level down for the `def parse_anything(value, context=0, skip_style_tags=False):` function (it is an alias for parse)
[09:39:37] <wikibugs>	 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) I've done 1002-1008 today, and everything went smoothly. All done!
[09:39:47] <klausman>	 Kubelet partition size increase all done!
[09:46:31] <elukey>	 nice!
[09:48:18] <klausman>	 Remind me, will we also have to tweak partman, in case we reinstall machines or add workers?
[09:49:39] <elukey>	 could be an option yes, or we remember to do it when we create new workers
[09:49:56] <elukey>	 we share the recipe with serviceops so we can't change it, in case we'd need to duplicate etc..
[09:50:11] <klausman>	 That's a tough choice between "dealing with partman" and "have to remember random factoid when installing" :))
[09:50:29] <klausman>	 But then again, even if we forget, it's not a super disruptive fix.
[09:51:14] <klausman>	 Hmmm. Maybe add a check to prom that alerts if our /var/lib/kubelet is smaller than 50G or so.
[09:51:28] <klausman>	 easier to constrain that to just ML machines
[09:52:41] <elukey>	 not sure, the best is probably the partman recipe 
[09:52:48] <elukey>	 isaranto: proceeding with https://gerrit.wikimedia.org/r/c/operations/dns/+/957689, ok?
[09:52:55] <elukey>	 even if it is a day earlier
[09:53:10] <elukey>	 klausman: do you want to rollout the dns change?
[09:53:21] <klausman>	 can do!
[09:54:46] <isaranto>	 let's do it!
[09:55:08] <elukey>	 super, so the procedure is simple - merge, jump on a auth-dns server node and run `sudo -i authdns-update`
[09:55:36] <klausman>	 what are the auth dns server names again?
[09:56:02] <elukey>	 this bit is left for exercise :)
[09:57:36] <klausman>	 I'd expect authdnsXXXX, but my known hosts file doesn't have that
[09:58:01] <elukey>	 I'd check puppet's site.pp
[10:00:06] <klausman>	 ah, so this changed earlier this year
[10:00:46] <klausman>	 So any dnsXXXX will do?
[10:01:01] <elukey>	 yep
[10:01:10] <klausman>	 ok, will merge and run update
[10:01:26] <elukey>	 and also !log it in #operations (with the ref to the task)
[10:01:32] <klausman>	 ack
[10:04:08] <klausman>	 And done. No errors.
[10:04:47] <klausman>	 $ dig ores.wikimedia.org
[10:04:51] <klausman>	 ores.wikimedia.org. 300 IN CNAME dyna.wikimedia.org.
[10:05:19] <elukey>	 super
[10:07:22] <elukey>	 klausman: could you also update our ml-team's slack thread?
[10:07:26] <elukey>	 where we listed the steps
[10:09:21] <klausman>	 ack
[10:12:57] <elukey>	 isaranto: found a way to get the stacktrace, seems all starting from
[10:13:04] <elukey>	 feature_values = list(extractor.extract(rev_id, model_features, cache=cache))
[10:13:07] <klausman>	 wait, I have a question if right now we have ores -cname-> dyna and ores-legacy -cname-> dyna. If change 95769 just changes it to ores -cname-> ores-legacy, isn't that basically a noop?
[10:15:56] <isaranto>	 elukey: so it is not from revscoring package?
[10:16:21] <elukey>	 it is yes, lemme paste one (sometimes it varies a little)
[10:16:24] <isaranto>	 or u mean it starts from there?
[10:16:36] <elukey>	 exactly yes
[10:17:16] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) One trace sample from `tracemalloc`:  `   File "/srv/rev/model-server/model.py", line 24     kserve.ModelServer(workers=1).start([model])   File "/opt/lib/python/s...
[10:17:23] <elukey>	 isaranto: --^
[10:17:43] <elukey>	 klausman: it is a very good point, I assumed it would have worked but I'd need to check
[10:19:18] <elukey>	 so dyna points to our LVS external IPs
[10:19:57] <klausman>	 And I suspect LVS makes the routing decision based on the Host: header?
[10:19:58] <elukey>	 we end up in the caching layer, that routes the request accordingly
[10:20:08] <elukey>	 nono LVS is a L4 LB
[10:20:37] <klausman>	 Ah, but how does the caching layer know which backend svc to use?
[10:21:21] <elukey>	 it is an ATS setting, we have this in puppet
[10:21:24] <elukey>	     - type: map
[10:21:24] <elukey>	       target: http://ores.wikimedia.org
[10:21:24] <elukey>	       replacement: https://ores.discovery.wmnet
[10:21:24] <elukey>	     - type: map
[10:21:24] <elukey>	       target: http://ores-legacy.wikimedia.org
[10:21:26] <elukey>	       replacement: https://ores-legacy.discovery.wmnet:31443
[10:21:33] <elukey>	 so we'll need to file a puppet change
[10:21:39] <elukey>	 lemme me self -1 me
[10:22:57] <elukey>	 we need a HTTP redirect
[10:23:05] <elukey>	 there should be a service for it
[10:29:56] <klausman>	 Also note that in role/common/cache/text.yaml, the caching settings for ores and ores-kegacy are different (probably intentional, but if we switch to point one at the other we might need to update it)
[10:30:28] <elukey>	 what is the difference?
[10:32:17] <klausman>	 ores is set to normal, the other to pass
[10:32:35] <elukey>	 makes sense yes
[10:32:35] <klausman>	 I suspect "pass" means to not cache at all
[10:32:39] <elukey>	 yep
[10:32:54] <klausman>	 Thing is, I don't know if ATS ever caches POST requests
[10:33:02] <klausman>	 The the difference might be moot.
[10:33:10] <klausman>	 s/The//
[10:33:34] <klausman>	 anyway, I'm off to errands and lunch, ttyl
[10:33:34] <elukey>	 it is varnish+ATS the combination, but I don't think that they'll allow us to do it
[10:33:37] <elukey>	 too much variations
[10:33:51] <elukey>	 but it is fine (this is what we are discussing in the caching task)
[10:34:02] <klausman>	 good! :)
[10:45:30] * elukey lunch!
[10:47:43] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) I'm testing with a specific rev_id I found that causes a memory increase:  153877972 https://phabricator.wikimedia.org/P52539  As you can see it seems that...
[10:47:56] <isaranto>	 how did you get the stack trace with tracemalloc?
[10:48:04] <isaranto>	 nevermind , after lunch!
[11:02:47] * isaranto lunch
[13:09:27] <elukey>	 isaranto: yep exactly with tracemalloc
[13:13:09] <wikibugs>	 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) I left the code running for a bit, I wanted to test disabling caching in revscoring, I ended up with the following (more precise) trace:  `  File "/opt/lib/python/...
[13:15:22] <isaranto>	 yes, but how do you obtain that ? with the top 10 line numbers? or sth else?
[13:16:28] <elukey>	 isaranto: added it to the task
[13:16:46] <isaranto>	 a ok thanks! super cool
[13:18:01] <isaranto>	 what makes me wonder is the following : I have been testing with just one rev_id and the increase in memory seems random. one time it happens and the other not
[13:18:12] <isaranto>	 which is super weird
[13:20:07] <elukey>	 it is an easter egg in revscoring :D
[13:20:24] <elukey>	 it is so poetic that we found a major leak just before deprecating ores
[13:20:43] <elukey>	 the revscoring's _solve function is a mess
[13:24:40] <wikibugs>	 10Machine-Learning-Team, 10Commons: Utilize ChatGPT for categorizing and extracting metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10calbon) Hi Hoi! Unfortunately we can't provide free access to ChatGPT, however we are working on hosting large language models on WMF's infrastructur...
[13:28:55] <isaranto>	 I've tried explicitly clearing inputs and cache (assigning to None) but doesnt do anything. afaiu the issue is with async calls
[13:29:11] <elukey>	 async calls?
[13:29:23] <elukey>	 we don't do any with revscoring 
[13:29:28] <wikibugs>	 (03CR) 10Klausman: fix: revscoring model server inputs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos)
[13:32:11] <isaranto>	 not with revscoring within the model server I mean. in a synchronous setting releasing the variables from memory would do the trick (although it is a hack)
[13:41:22] <wikibugs>	 (03PS1) 10Elukey: revscoring: consume all data from the extractor [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445)
[13:42:11] <isaranto>	 interesting!
[13:42:24] <isaranto>	 lemme try it locally to see what's going on with memory
[13:42:44] <elukey>	 it is a shot in the dark, I still need to run more extensive tests, buuuut it may help
[13:44:39] <isaranto>	 until now it is our best bet. I haven't found any memory leaks in any revscoring function until now. even _solve doesnt increase memory usage
[13:47:00] <elukey>	 trying to run some long tests
[13:48:16] <isaranto>	 I see still issues
[13:48:44] <isaranto>	 https://phabricator.wikimedia.org/P52542
[13:52:37] <elukey>	 same issues? (sorry it is difficult to parse the traces)
[13:53:03] <elukey>	 on my side I see no rapid increase of mwparserfromhell
[13:54:06] <elukey>	 klausman: the suggestion from Traffic was to switch the ATS config 
[13:56:18] <isaranto>	 elukey: from my investigation so far the issue is with the generator as you suggested. I can explain what these memory traces mean in person
[13:56:49] <klausman>	 elukey: So the target/replacement thing you pasted? SGTM
[13:56:49] <isaranto>	 but they show that although memory usage increases it is not caused by the extractor.py
[13:57:02] <isaranto>	 * the extract function
[13:58:22] <elukey>	 isaranto: ack, so IIUC we can merge or do you prefer to wait?
[13:58:50] <elukey>	 no ok now I see mwparserfromhell in my tests at the top
[13:58:58] <elukey>	 seems slower than before
[13:59:07] <isaranto>	 I tried the suggestion in gerrit and it doesnt seem to help
[14:48:38] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "According to some memory profiling I ran locally trying out this change, it seems that the issue persists" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey)
[14:51:31] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Something really interesting: the following rev-id (https://es.wikipedia.org/w/index.php?diff=153880256) causes a big jump in the size of mem...
[14:51:49] <elukey>	 isaranto: --^
[14:53:18] <isaranto>	 great! at least we have a good revid candidate to test
[14:54:22] <isaranto>	 just verified it as well, causes approx 2MB increase
[14:55:14] <elukey>	 the count of objects doubles, I am wondering if we can dump those
[15:12:31] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) It seems that we don't use mwparserfromhell version 0.6.5 that has [[ https://github.com/earwig/mwparserfromhell/pull/303 | this impr...
[15:12:48] <isaranto>	 elukey: --^ found it!
[15:13:16] <isaranto>	 the solution was where you pointed earlier today. thanks for spotting that PR!
[15:15:18] * elukey cries in a corner
[15:15:30] <isaranto>	 pepito found it!
[15:15:48] <elukey>	 good boy pepito!
[15:16:01] <isaranto>	 🐶
[15:16:11] <isaranto>	 opening a PR to make a new release for revscoring
[15:17:20] <isaranto>	 the other thing is that we are in the middle of upgrading kserve to 0.11 so we'll have to revert unless anyone has a nicer idea
[15:17:33] <elukey>	 yeah let's do it 
[15:17:49] <elukey>	 we fix the leak, and then we upgrade
[15:18:01] <elukey>	 lemme revert
[15:18:12] <elukey>	 then you can apply the new revscoring, does it sound good?
[15:18:41] <elukey>	 it is weird that we use 0.6.4 though
[15:19:06] <elukey>	 isaranto: revscoring/requirements.txt:mwparserfromhell==0.6.4
[15:19:15] <elukey>	 we don't need another release
[15:19:20] <elukey>	 it is our requirements.txt
[15:19:56] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) We set it in our requirements.txt :(  ` revscoring/requirements.txt:mwparserfromhell==0.6.4 `
[15:20:20] <isaranto>	 yes, lets bump that one
[15:20:35] <wikibugs>	 (03Abandoned) 10Elukey: revscoring: consume all data from the extractor [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959221 (https://phabricator.wikimedia.org/T346445) (owner: 10Elukey)
[15:20:59] <isaranto>	 and I'll make the revscoring release later to enorce future use of that version
[15:21:02] <isaranto>	 *enforce
[15:21:12] <wikibugs>	 (03PS1) 10Elukey: Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023
[15:21:35] <elukey>	 ok so reverting the kserve 0.11 upgrade
[15:22:04] <isaranto>	 I can make the new patch and test it afterwards
[15:23:47] <elukey>	 I recall that we had to state all dependencies due to a slow performance in pip resolving deps
[15:23:51] <elukey>	 but this is crazy
[15:23:53] <klausman>	 Nice catch, guys!
[15:23:58] <elukey>	 maybe we don't need this mess anymore?
[15:23:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey)
[15:25:15] <isaranto>	 it depends on how you look at it. I like explicit dependency on the final application as you can debug more easily. otherwise you have no idea which version is used on your app from a specific commit
[15:25:39] <elukey>	 yeah but we can end up with stale deps
[15:25:50] <isaranto>	 I'm not strong about it 
[15:26:29] <elukey>	 CI is taking ages :)
[15:26:41] <isaranto>	 we can allow patches (since all packages use semver)
[15:27:00] <isaranto>	 so going from 0.6.4 to 0.6.5 would have happened automatically.
[15:27:20] <isaranto>	 We're super lucky the new release was 2 weeks ago 😛
[15:29:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey)
[15:29:13] <elukey>	 definitely
[15:29:27] <elukey>	 isaranto: let's allow patch versions for mwparserfromhell
[15:29:35] <isaranto>	 ack
[15:29:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Upgrade revscoring images to KServe 0.11" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959023 (owner: 10Elukey)
[15:31:32] <elukey>	 isaranto: change merged, ready for the fix :)
[15:31:51] <isaranto>	 great! I am building the image locally to test
[15:35:01] <isaranto>	 in the meantime https://github.com/wikimedia/revscoring/pull/549
[15:35:48] <elukey>	 approved
[15:37:14] <isaranto>	 reviews faster than CI
[15:37:43] <elukey>	 isaranto: I totally forgot that we now auto-publish revscoring
[15:40:53] <isaranto>	 :)
[15:43:31] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445)
[15:43:51] <isaranto>	 still building the local image...
[15:45:44] <isaranto>	 revoscoring package 2.11.13 is out , we can include it in the kserve upgrade since we should keep the above fix minimal
[15:45:50] <isaranto>	 *revscoring
[15:45:53] <wikibugs>	 (03CR) 10Elukey: "Let's also bump revscoring's version :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[15:46:10] <elukey>	 yeah exactly
[15:46:19] <elukey>	 no sorry, let's do it now
[15:46:23] <elukey>	 just to be sure
[15:46:29] <isaranto>	 if we bump revscoring we'll have to bump numpy etc
[15:46:36] <elukey>	 yeah yeah sorry PEBCAK
[15:46:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[15:46:54] <elukey>	 go go go
[15:46:57] <isaranto>	 we are now using version 2.11.10 https://github.com/wikimedia/revscoring/blob/master/CHANGELOG.md
[15:48:38] <isaranto>	 waiting for my image to build just to be sure (in case there is some other mwparser dependency that would cause 0.6.4 to be installed). just to be sure!
[15:49:17] <isaranto>	 elukey: I'll merge and release
[15:49:26] <isaranto>	 i'll be around as I have another meeting in 10'
[15:49:56] <elukey>	 isaranto: I can take care of the rollout if you want
[15:50:01] <elukey>	 it is late, don't worry
[15:50:08] <elukey>	 once the docker image is up I'll deploy
[15:51:56] <isaranto>	 but I'll be around so I can do it dont worry
[15:53:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[15:54:46] <wikibugs>	 (03Merged) 10jenkins-bot: revscoring: upgrade mwparserfromhell to solve memory leak [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959271 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[15:57:20] <isaranto>	 making the patch in dep charts
[16:08:21] <klausman>	 anything I can help with?
[16:11:38] <elukey>	 isaranto, klausman - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959309/
[16:11:54] <elukey>	 (Ilias take it easy you are in a meeting, we can deploy :)
[16:14:12] <elukey>	 I think that we can just do damaging and goodfaith for now
[16:14:16] <elukey>	 and do the rest tomorrow
[16:16:39] <isaranto>	 lol I created a patch as well, feel free to abandon it
[16:17:03] <elukey>	 folks let's try to coordinate in here :)
[16:17:47] <elukey>	 isaranto: I also removed damaging's staging settings, since it was using another image
[16:17:54] <elukey>	 not sure if we want to fix it as well or not
[16:18:25] <isaranto>	 I was using it to deploy kserve 0.11
[16:18:33] <isaranto>	 feel free to do anything with it :)
[16:20:13] <elukey>	 rolling out your patch :)
[16:21:48] <isaranto>	 thanks!
[16:24:39] <klausman>	 elukey: +1'd the DNS rollback
[16:25:07] <elukey>	 klausman: if you want you can roll it out
[16:25:12] <klausman>	 will do
[16:25:13] <elukey>	 sorry for the trouble
[16:25:17] <klausman>	 Should I merge it or will you?
[16:25:21] <klausman>	 no worries!
[16:25:30] <elukey>	 go ahead
[16:25:31] <elukey>	 :)
[16:25:34] <klausman>	 ack
[16:27:52] <isaranto>	 I'm here again
[16:28:18] <isaranto>	 want me to deploy anything?
[16:28:25] <elukey>	 almost done
[16:28:45] <elukey>	 I am only doing goodfaith/damaging
[16:28:49] <elukey>	 we can do the rest tomorrow
[16:29:24] <isaranto>	 ack
[16:29:29] <elukey>	 aaand rolled out
[16:30:18] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Fix deployed to goodfaith/damaing pod environments. Let's double check tomorrow that the memory metrics are stable and then we can roll out t...
[16:31:55] <elukey>	 isaranto: all done!
[16:32:04] <elukey>	 I don't see weird errors in logstash
[16:32:49] <elukey>	 container metrics for eswiki are good
[16:32:53] <elukey>	 soooo I'd call it a day :)
[16:33:08] <isaranto>	 great work team!
[16:33:17] <elukey>	 yep! \o/
[16:33:22] <elukey>	 have a nice rest of the day folks!
[16:36:12] <isaranto>	 logging off as well \o/
[16:49:03] <klausman>	 \o
[17:03:16] <aiko>	 niceeee \o/ good job everyone!!
[18:30:30] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10Isaac) > I have tested the recommendation-api with related_articles switched off and it ran without the errors. Great to hear! Thanks!