[09:39:05] elukey: Are you able to see the result on this dashboards? https://logstash.wikimedia.org/app/dashboards#/view/ORES?_g=h@e78830b&_a=h@69a558e [09:40:01] because I only see no results found [09:42:42] aiko: ah weird! [09:43:21] https://logstash.wikimedia.org/app/dashboards#/view/ba190230-deb8-11e8-99b8-7fba019e77c2 [09:43:24] does this work? [09:44:02] (changed it in the task too) [09:49:53] elukey: yes ORES error log dashboard works. only the ORES dashboard doesn't work for me. [09:52:07] elukey: another question.. when I ssh to ores1001.eqiad.wmnet, it asked for password. I entered my wikimedia developer password but it doesn't work. Do you know which one I should put in? [09:56:14] aiko: that should be the password of your ssh key [09:56:35] if you have one, if not then ssh is probably not configured correctly [10:01:56] no I don't have the password for my ssh key [10:03:33] but there is no issue when I ssh to stat machine like stat1007.eqiad.wmnet [10:03:51] mmm can you copy your ssh config somewhere so I can check? [10:03:54] cat .ssh/config [10:04:06] https://phabricator.wikimedia.org/paste/ [10:04:18] also, ssh -vvv ores1001.eqiad.wmnet [10:04:25] (the output I mean) [10:05:11] ok! [10:05:17] wait a sec [10:09:05] https://phabricator.wikimedia.org/P29309 [10:09:13] https://phabricator.wikimedia.org/P29310 [10:09:54] ah ok now I get it [10:10:43] try to add the following configs to your ssh/.config file [10:10:45] Host *.eqiad.wmnet ProxyCommand ssh -a -W %h:%p bast3005.wikimedia.org [10:10:46] Host *.codfw.wmnet ProxyCommand ssh -a -W %h:%p bast3005.wikimedia.org [10:10:52] ah snap wrong formatting [10:11:10] https://phabricator.wikimedia.org/P29311 [10:11:12] aiko: --^ [10:14:57] any modern ssh client should support the ProxyJump directives just fine [10:15:19] elukey: I did it. but it still asks password [10:20:02] elukey: looking at data.yaml I don't see 'aikochou' in any groups that would permit access to ores1001 (ores-admin or ops) [10:20:49] taavi: I was about to check that, indeed [10:21:11] I was convinced that Aiko was allowed, but I think it is only to k8s nodes [10:22:56] aiko: so your config is fine, we need to create a task to modify the ORES admins :) [10:23:12] I'll file it after lunch, somebody else will have to checks ores1001 today [10:23:14] no problem :) [10:23:18] thanks taavi :) [10:24:02] I see. thanks taavi and Luca :) [10:29:35] * elukey lunch [10:32:24] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10incubator.wikimedia.org: Integrate the model training and the deployment of "Add a link" to new Wikipedias exiting the Incubator - https://phabricator.wikimedia.org/T308146 (10Trizek-WMF) [12:41:00] elukey: of course I notice your patches only after writing half of the BGP change myself :-P [13:07:19] klausman: ah snap sorry my bad :( [13:07:30] I wanted to help, I should have synced first [13:07:34] apologies [13:08:58] no worries :) [13:09:34] The upside is that a) you did a few things I would have missed and b) review was easier since I had all the requisite pages opne (Netbox etc) [13:16:44] ack :) [13:16:54] for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/802097 do you want to sync with Kevin to deploy it? [13:17:58] yeah, I totes forgot to hit return in that chat window :D [13:21:24] Morning all! [13:21:51] \o [13:22:09] We deploying ORES? [13:24:00] morning :) [13:24:11] yep we are, we can do it in here or on a meet [13:24:16] probably ok as well in here [13:24:28] let's wait for Aiko and see what she prefers [13:24:31] o/ [13:24:38] there you go : [13:24:39] :) [13:24:51] I prefer google meet [13:25:23] sure [13:25:33] I'm already on the VC from the calendar entry :) [14:28:10] deployment completed, we see the new error message now [14:28:10] https://ores.wikimedia.org/v3/scores/wikidatawiki/1334902099/itemquality [14:28:25] so yeah as a follow up we probably need to tweak a little canary/etc... [14:28:29] to be more meaningful [14:29:30] good job aiko :) [14:32:08] 10ORES, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: revscoring feature extraction error for wikitext papes in Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) 05Open→03Resolved All deployed, new log message is popping up! https://ores.wikimedia.org/v3/scores/wikidatawiki/1... [14:32:16] closed a couple of tasks [14:33:15] now it is wayyy easier from the logstash dashboard to figure out what errors ores is returning [14:33:58] \o/ [14:34:02] and it seems that we have bots that use the itemsquality model incorrectly, trying to score revisions that are not meant to be scored [14:34:48] back in a bit [14:59:55] yeah so weird they keep sending requests [15:10:40] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) a:05kevinbazira→03elukey [15:57:59] klausman: if you have a min, I staged a change in the private puppet repo for the ml-staging token changes [15:58:03] since now we use the ml-serve umbrella [15:58:08] lemme know if it is ok for you [15:58:23] (just to avoid leaving things in a half broken state before my long weekend) [15:58:34] sgtm [16:01:45] committed, I am now running puppet :) [16:02:16] I'll do some quick checks, after those you can go ahead with BGP etc.. if you wanto tomorrow/friday [16:03:33] ok done [16:04:51] elukey@deploy1002:~$ ls /etc/helmfile-defaults/private/ml-serve_services/cfssl-issuer/ [16:04:54] ml-serve-codfw.yaml ml-serve-eqiad.yaml ml-staging-codfw.yaml [16:05:08] and we also have ml-staging-codfw specific helmfile configs now etc.. [16:06:16] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10elukey) We moved the ml-staging configs under the ml-serve umbrella, since the staging cluster will be used to test serving things mostly.... [16:08:43] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Create the ml-serve-staging k8s cluster - https://phabricator.wikimedia.org/T302195 (10elukey) Next steps: - get a final review of https://gerrit.wikimedia.org/r/c/operations/homer/public/+/802072 - merge and proceed with htt... [16:09:53] klausman: we'd also probably need to add another LVS endpoint like inference-staging.svc.codfw.wmnet [16:15:20] going afk folks, talk with you next week :) [16:26:32] Bye elukey! [18:00:35] 10Machine-Learning-Team, 10ORES, 10SRE: Stress test ORES on kubernetes (above 4.5k scores/second) - https://phabricator.wikimedia.org/T214054 (10Krinkle)