[05:12:22] <wikibugs>	 10Machine-Learning-Team: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 (10kevinbazira) The new fawiki articlequality model has been uploaded successfully to production storage ([[ https://wikitech.wikimedia.org/wiki/Thanos | Thanos Swift ]]).  Here is its...
[07:36:10] <elukey>	 good morning :)
[07:39:17] <ilias>	 Hey all! Have a a great week :)
[07:40:37] <elukey>	 hellooooo
[07:41:13] <elukey>	 ilias: when you have a moment, can you try to ssh to say stat1004.eqiad.wmnet to double check that everything works?
[07:58:45] <ilias>	 I can reach the host but it requires a passed..
[07:58:49] <ilias>	 *password
[08:05:24] <elukey>	 and you have something like https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config in your ssh config?
[08:05:38] <elukey>	 if so, can you post somewhere the result of ssh -vvv stat1004.eqiad.wmnet ?
[08:05:54] <elukey>	 https://phabricator.wikimedia.org/paste/ is handy
[08:07:34] <elukey>	 (I am wondering if the right ssh key is picked up, it shouldn't ask for a password)
[08:10:17] <ilias>	 It seems like the right key is picked up… Here is the debug result https://phabricator.wikimedia.org/P38186
[08:10:36] <ilias>	 same happened for other host I tried on Friday eve
[08:13:38] <elukey>	 mmm on the bastion I see 
[08:13:39] <elukey>	 Nov  7 08:08:17 bast6001 sshd[28880]: error: PAM: Authentication failure for isaranto
[08:13:48] <elukey>	 can you share the ssh config?
[08:16:25] <elukey>	 (it happens all the time, the ssh config is really sneaky)
[08:19:27] <ilias>	 https://phabricator.wikimedia.org/P38187
[08:19:57] <ilias>	 I managed to ssh to bast6001.wikimedia.org
[08:31:17] <elukey>	 ah!
[08:32:11] <elukey>	 ilias: just as a test, what about ores1001.eqiad.wmnet?
[08:33:25] <ilias>	 Same. Finds the prod key and requires a passwd
[08:33:32] <elukey>	 okok 
[08:37:44] <elukey>	 mmmm my config is different from the one listed on wikitech, somebody updated it
[08:39:53] <elukey>	 ilias: I re-checked the debug logging and I noticed this
[08:39:54] <elukey>	 add_identity_file: ignoring duplicate key ~/.ssh/prod
[08:40:59] <elukey>	 but you can ssh to bast6001, so the right key should be picked up
[08:42:17] <elukey>	 ilias: if you try to add the IdentityFile setting to https://phabricator.wikimedia.org/P38187$8, does it change anything?
[08:42:27] <elukey>	 like IdentityFile ~/.ssh/prod
[08:45:51] <ilias>	 :D I’m in! https://en.wikipedia.org/wiki/Facepalm (don’t know how to do facepalm with chars)
[08:46:19] <elukey>	 niceeeee
[08:46:22] <elukey>	 \o/
[08:47:46] <elukey>	 the stat100[4-8] nodes are owned by Data Engineering, and those are general-purpose compute nodes 
[08:48:01] <elukey>	 on 1005 and 1008 there is an AMD GPU as well, and a ton of ram
[08:48:26] <elukey>	 usually researchers/engineers/etc.. use those nodes to experiment and contact other production services
[08:48:38] <elukey>	 (home dirs are on a raid partition etc..)
[08:49:51] <ilias>	 Cool, thanks!
[08:52:00] <elukey>	 we have more or less the following hosts in production (I mean owned by our team)
[08:52:14] <elukey>	 - ores200[1-9] - ORES codfw cluster
[08:52:28] <elukey>	 - ores100[1-9] - ORES eqiad cluster
[08:52:40] <elukey>	 - ml-serve100X - Lift Wing eqiad
[08:52:48] <elukey>	 - ml-serve200X - Lift Wing codfw
[08:53:02] <elukey>	 - ml-staging200X - Lift Wing staging (codfw only)
[08:53:19] <elukey>	 (there are a little more but as initial summary I think it is ok)
[08:53:45] <elukey>	 eqiad and codfw (Virginia and Dallas) are our main data centers, namely where we run services
[08:54:07] <elukey>	 we tend to have active/active ones, namely both dcs can serve traffic (so if one fails, the other one can take over etc..)
[08:55:02] <elukey>	 the other DCs that you see mentioned (drmrs - Marseille, ulsfo - San Francisco, esams - Amsterdam, eqsin - Singapore) are only serving our CDN 
[08:55:56] <elukey>	 for example - the bastion that you are connecting to is in Marseille, but in that DC there is no app service deployed, only few things (CDN with ATS and Varnish, DNS, bastion, etc..)
[08:56:32] <elukey>	 We are currently serving ML traffic from both ORES clusters, and in the bright future we'd like to decom them in favor of Lift Wing
[08:59:44] <ilias>	 nice!
[09:30:32] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) The model servers (revscoring-related) that make 3 API calls during preprocess (edit and draft quality) have most of the issues with latency, meanwhile the rest work really nic...
[10:50:58] <moritzm>	 FYI, I'm switching ml-etcd1002 to DRBD for a bit, needed for a VM migration to drain a virt node. latencies might go up  a bit
[10:52:47] <klausman>	 ack!
[11:28:03] <moritzm>	 ml-etcd1002 has been restored to "plain" disk stoage
[11:28:32] <klausman>	 merci
[11:39:55] * elukey lunch
[12:35:57] * klausman lunch, too
[15:17:17] <isaranto>	 are there any standards or preferred way in the team for managing python (virtual) environments?
[15:19:49] <elukey>	 isaranto: any specific use case in mind? 
[15:20:36] <isaranto>	 I mean for local purposes
[15:23:05] <elukey>	 ah nono nothing really established, when working on stat100x DE suggests to use conda
[15:23:13] <elukey>	 there is some documentation on wikitech
[15:23:57] <isaranto>	 Ok, thanks!
[15:25:18] <elukey>	 isaranto: there is a data-engineering channel on slack if you want to join
[15:26:07] <isaranto>	 ack!
[17:51:07] <elukey>	 going afk folks, see you tomorrow :)