[08:13:20] Hello! Was there any update (just before march 22nd 18:15 UTC) which affected mono? according to strace the program hangs in some system call "futex". Program was working fine since end of august 2020, no change since? [08:40:18] Hmm … strange. Problem vanished in some magic way. No idea why [14:10:59] I'm trying to use openstack client from cloudcontrol for the search project and it's prompting me for a pw, is that my wikitech creds or what should it be? [14:12:39] inflatador: we encourage you to use horizon.wikimedia.org instead [14:13:14] inflatador: the password available on cloudcontrol is mostly for nuke-level operations [14:13:47] I don't want to use it as the operator for the entire infra, just for my specific project [14:14:28] that's not something we usually support [14:14:42] that being said, you could use `wmcs-openstack --os-project-id=whataever` [14:14:44] inflatador: you should be able to use the api with you user, by going ot horizon and on your user tab (top right), there's an "openstack RC file" link, download that and source it beforu using openstack cli [14:15:10] dcaro I tried that, but it's prompting for a pw and I don't know what should be. I have 2FA setup for wikitech if that helps [14:16:06] arturo thanks, I'll take a look. I need to be able to test some elasticsearch stuff and it would be helpful to be able to use the API [14:16:16] to stand up/break down test envs quickly [14:16:30] * arturo nods [14:16:47] I used to be an operator for Rackspace's openstack cloud, so if I can do anything to help lmk [14:16:50] there's some cookbooks already doing some stuff on openstack if you prefer to automate it that way [14:17:29] we're hiring :-P [14:17:32] yeah, I'll take a look...I'm more of a terraform/ansible dude though ;) [14:17:50] arturo I already work for the Foundation, I'm an SRE on the Search team ;p [14:17:58] I know, that's why the joke [14:18:40] inflatador: you can try to give a look at https://gerrit.wikimedia.org/r/admin/repos/cloud/wmcs-ansible,branches, but itś essentially unmaintained [14:20:28] oh yeah, I was just thinking of using tf-openstack to build the servers and ansible for the instance-level config [14:25:56] !log tools.stewardbots ./SULWatcher/manage.sh restart # all three disconnected [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [14:29:31] hrm, that took longer than expected... [14:29:46] kubectl get pods seems to be slow rn, at least on stewardbots [14:31:54] inflatador: not sure if you came across it already, though if you are looking for production-like "environments" check out https://wikitech.wikimedia.org/wiki/Puppet/Pontoon too (full disclosure, I wrote it) [14:33:25] godog Yes, I've been meaning to get deeper into it, just haven't had time (rough notes at https://wikitech.wikimedia.org/wiki/User:Bking/Notes/Pontoon) . I need a less production-like environment to start with ;) [14:36:10] inflatador: ack, thank you the notes are helpful, haven't tried a bootstrap myself in a little bit now [15:12:07] ebernhardson: Can you respond on https://phabricator.wikimedia.org/T304581#7802342 sometime soon? thanks! [15:15:02] andrewbogott: i'm asking in #wikimedia-search :) I'm familiar with what it is but i don't really have much to do with the instance [15:16:14] ok -- feel free to cc whoever on the ticket [15:18:50] andrewbogott just curious, is it possible to migrate the VM to a different host? [15:19:35] inflatador: those wdqs VMs are using local storage and filling their whole hypervisors. It's basically a team-specific private server. [15:19:55] Oh yeah, I see that in puppet now, sorry [15:19:56] They're huge -- migrating them would involve much more downtime than just upgrading the hypervisor during downtime [15:20:44] OK, let me talk that over with my team, hopefully we can get rid of these instances [15:21:57] inflatador: to be clear, we're talking about 'wcqs-beta-01' right? [15:22:01] Just the one instance? [15:22:43] I'd like to trash those canary instances too, but have to run everything thru my team [15:23:37] the canaries are openstack overhead, they need to be there [15:23:50] Here is the initial ticket that procured that hardware: https://phabricator.wikimedia.org/T221631 [15:24:10] If that use case doesn't exist anymore then that would be great to know, we can decom or repurpose those servers. [15:24:48] ah OK, we probably do need these then, as we haven't decided on a blazegraph alternative yet [15:25:03] gehel might have more context for what's happening here [15:26:08] the use case is changing, but still exists. It's mostly to install alternatives to blazegraph and evaluate if they can fit our needs [15:26:23] (previously it was to install blazegraph and verify things with full-data outside of prod) [15:26:51] Will definitely talk to him, what's your approximate level of urgency for upgrading these hyps? [15:28:43] I'd like to do it next week. Due to an unrelated hardware bug I need to do several of these servers in a batch to avoid sending the dc-ops people on too many repeat missions. [15:29:10] But, again, just having a couple hours of downtime is an easy option for me if that doesn't ruin your work [15:37:52] yeah, sounds like option 2 as you described is probably best. Will have an answer for you by EoD my time (CDT, UTC-5) via the phab ticket. Thanks for the heads-up [15:47:00] thanks inflatador [19:42:09] PAWS has been migrated to a new UI very recently. Is this still work in progress? I lost a lot of changes in two Jupyter notebooks today which for some reason were automatically reverted to a much older version. I have already been using the new UI for quite a while, but this never happened to me. Any ideas? [19:56:54] Rook: ^ any ideas about why MisterSynergy may be having issues with notebook state? [19:57:51] No immediate ideas. Which notebooks? [19:58:44] some more background perhaps: I worked on two notebooks around noon this day (Europe time); left the machine for a couple of hours with PAWS in an open browser tab; when I returned, it had somehow loaded an older version of the notebook -- apparently from this morning [19:59:26] https://public.paws.wmcloud.org/User:MisterSynergy/misc/2020%2012%20worldrowing%20links/get_worldrowing_mapping_table.ipynb [19:59:53] I have been refactoring the code a lot in the meantime, so it is disappointing now that everything seems gone [20:00:52] Another one which should be longer than it is right now: https://public.paws.wmcloud.org/User:MisterSynergy/demos/use%20symbolic%20links/dump%20with%20symlinks.ipynb [20:05:30] can I get a quick reminder where I can find the fingerprints of the SSH hosts? [20:05:59] Cyberpower678: which ssh hosts? [20:06:12] bastion.wmcloud.org [20:06:25] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/primary.bastion.wmcloud.org [20:07:12] Tahnks [20:07:14] Thanks [20:07:24] np [20:07:38] MisterSynergy: both notebooks loaded up a version from this morning? [20:08:21] yes, roughly. the second one is pretty new a shows a very early version; the first one contains data from an execution yesterday [20:16:00] chicocvenancio: any thoughts? andrewbogott We didn't touch NFS today, right? [20:16:47] Rook: not on paws. [20:23:36] That spontaneous rollback makes me very nervous but I can't think of anything I've done today that would cause it. [20:23:56] Unless we have split-brain in the paws cluster and host migration caused us to flip between hosts somehow [20:25:35] maybe I should mention again that I have been working with the new Jupyter UI for a couple of months and never experienced it before [20:34:25] There is a "Revert Notebook to Checkpoint" function. I haven't been able to recreate it, but that could explain why it reverted to just this morning. The checkpoint could have been from this morning. I've been tinkering but have yet to recreate an automatic revert. It is possible that the older jupyterlab was keeping a checkpoint around and somehow that is causing the issue for you (MisterSynergy), or lots of people are seeing the [20:34:25] same and haven't reported it [20:35:24] My guess comes about from not seeing any updates between today and when we updated (on the 21st), I didn't look all over the filesystem though. MisterSynergy did you do any editing this week outside of today? [20:36:48] yes, I edited yesterday for sure and potentially the days before as well. I am using PAWS pretty regularly [20:37:06] There are indeed checkpoint files from the affected notebooks from this morning [20:37:34] they are sitting in a hidden .ipynb_checkpoints folder, as I am learning right now [20:38:33] It would also be nice to know if this is happening to many different paws users or just on these exact two notebooks [20:46:37] there is an "Autosave Documents" option in the settings; has been activated all the time here. Is it supposed to update the checkpoints? It seems that it does not do so; checkpoints are apparently only updated when I explicitly save the document [20:51:02] In theory autosave changes the file on disk, not the checkpoints [20:54:30] Maybe some lag between caching in the container and writing to disk? I see the checkpoint and the main file are the same, though I've updated them in the browser. When the container stops it manages to write to disk, but it has a graceful shutdown. Perhaps MisterSynergy's container really crashed and thus didn't have time to write to disk, thus the checkpoint was the same as the mainfile? [21:18:00] okay seems like I can try to recover the lost code. in my PAWS account, there is a ~/.ipython/profile_default/history.sqlite file that contains the code of all executed cells --- including the lost ones. I "just" need to stitch everything together again :-/ [21:35:57] Let us know if it happens again or you see anything else. My current guess on what is happening isn't very satisfying. [21:37:56] Will do. I did indeed just recover all my lost code from the sqlite database [21:38:07] This helped me: https://medium.com/flatiron-engineering/recovering-from-a-jupyter-disaster-27401677aeeb [21:38:26] Learning Jupyter internals the hard way :-) [22:06:32] Yeah, non graceful shutdown could be the source here. [23:58:19] hello, I have questions about dumps servers that I would have asked Brooke in the past. What is the best way to ask them now? [23:58:42] like.. is it ok to copy about 400MB to them..one time...given space issues in the past [23:59:46] I see 16T available or something..so maybe it is "why do you even ask" [23:59:59] just back in June 2021 or so .. it was "please don't.. "