[08:26:29] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) I followed what outlined in https://github.com/RadeonOpenCompute/ROCm/issues/1125#issuecomment-925362329 and created the two fake packages, i... [09:08:19] 10Machine-Learning-Team, 10Spike: [Spike] Run models and frameworks on AMD GPU and identify challenges - https://phabricator.wikimedia.org/T334583 (10isarantopoulos) [10:01:29] progress: I was able to install the rocm 5.4 packages on dse-k8s-worker, but of course now tensorflow doesn't work :D [10:03:05] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10kevinbazira) [10:03:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10kevinbazira) The conclusion on the backtesting results is that most of the languages look fine besides: - shnwiki has a low precision (0.50) and recal... [10:30:24] * elukey lunch! [12:22:44] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) I've thought how we could tackle this and there are 3 strategies I can think of: 1. Extend ORESServices objects to allow them to use Lif... [13:03:18] I added what we discussed about the ORES extension in the ticket. There is something we need to figure out about the thresholds. I'll think a bit more if we can do something else (other than the two options I add) [13:28:55] ack! [13:29:07] tensorflow working on dse-k8s-worker1001 \o/ [13:29:12] with rocm 5.4 [13:33:45] 🎉 [13:33:47] nice Luca [14:05:33] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [14:05:45] ouch [14:06:59] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1012 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/ORES [14:07:09] this is weird, didn't see anything horrible in the metrics [14:13:55] 10Machine-Learning-Team, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10MSantos) [14:32:12] (03PS1) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES for fetching scores [extensions/ORES] - 10https://gerrit.wikimedia.org/r/908563 (https://phabricator.wikimedia.org/T332953) [14:55:25] back from meetings [14:55:38] so the icinga check that fired about ores runs on the icinga nodes, and it is [14:55:41] $pluginpath/check_http -f follow -H "$host" -I "$host" -A "${user_agent}" -S -u "http://${urlhost}/v3/scores/fakewiki/${timestamp}/" [14:55:58] and timestamp=$(/bin/date +%s) [14:56:26] now no idea what happened, but ores didn't reply to fake wiki's request from the icinga nodes [14:56:36] I don't see any culprit from the graphs [14:57:16] the link is something like https://ores.wikimedia.org/v3/scores/fakewiki/1681397818 [14:57:51] I'd say that it may have been a temporary hiccup (no other alert fired), let's see if it rehappens [15:02:22] what what is fakewiki? [15:02:45] only reference I found here -> https://wikitech.wikimedia.org/wiki/ORES/Deployment [15:03:02] unless it is just an example/placeholder so nevermind [15:04:49] yeah I think it is some sort of a placeholder [15:41:46] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) Updated the docs, I was able to run tensorflow on dse-k8s-worker1001 successfully. The remaining issue is to add the proper users to the `ren... [15:43:50] 10Machine-Learning-Team, 10Platform Team Workboards (Platform Engineering Reliability): Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10elukey) This is complete, I don't think that there are more streams to migrate over. The only nit to fix is that t... [15:47:15] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) a:03elukey [15:47:22] 10Machine-Learning-Team: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) a:03elukey [16:24:16] going afk folks, have a nice rest of the day! [19:11:57] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10prabhat) @elukey Thanks for the clarification. Apart from batch calls to ORES that happen only few times a year, we also have realtime APIs where we call ORES. Basically,...