[05:35:11] <_joe_>	 andrewbogott: puppetmaster1001:~$ puppet --version
[05:35:13] <_joe_>	 5.5.22
[05:35:18] <_joe_>	 we're on 5.5
[05:49:08] <andrewbogott>	 right, but did they start spamming us with deprecation notices before the replacement was available? I'm used to a deprecation notice being a call to action, not just a warning of future action
[05:54:29] <_joe_>	 andrewbogott: I am just stating you should probably look at the docs for 5.5
[05:54:49] <_joe_>	 6.x is the first clojure version IIRC
[06:16:17] <elukey>	 o/
[06:34:13] <XioNoX>	 fyi, transport between esams and eqiad is down, I opened a ticket with Lumen: https://phabricator.wikimedia.org/T317009
[06:34:59] <_joe_>	 that's not great
[06:37:07] <XioNoX>	 _joe_: we now have last hope redundancy through drmrs if GTT (the normal redundant link) fails too, so no need to depool for now
[08:43:07] <jbond>	 hi all back from vacation, will mostly be catching up on emails today but feel free top ping if there is soemthing to pop to the top of the list
[08:43:36] <volans>	 welcome back!
[08:48:17] <jbond>	 thaks :)
[08:51:45] <moritzm>	 welcome back :-)
[08:59:50] <btullis>	 Morning all. I'm planning to deploy calico to the dse-k8s cluster soon, as per: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Calico_node/controllers - Any reason to hold off or any other concerns?
[09:06:51] <elukey>	 btullis: o/ do you have the patch for core routers ready?
[09:07:05] <elukey>	 otherwise Calico pods will not be able to establish BGP sessions
[09:10:25] <btullis>	 elukey: It's already deployed, I believe. Took a few goes and a final fix by topranks: https://gerrit.wikimedia.org/r/q/bug:T310174 - but I think that core switches and ToR in eqiad E-F are all ready.
[09:10:25] <stashbot>	 T310174: Configure routing for dse-k8s cluster - https://phabricator.wikimedia.org/T310174
[09:14:06] <elukey>	 wonderful :)
[09:14:46] <elukey>	 IIUC jayme is doing some work on Calico's charts but it should be ok to proceed in theory if the networking part is set up
[09:16:44] <jayme>	 my calico work is prep for the k8s 1.23 upgrade only and I will pin the calico chart versions to the current lates in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/826268
[09:16:55] <jayme>	 so no change there
[09:20:23] <btullis>	 Thanks both. Proceeding to deploy then, if there are no other concerns.
[10:01:56] <btullis>	 I'm getting a timeout on deploying calico-kube-controllers at ContainerCreating - investigating now.
[10:13:38] <elukey>	 btullis: a helm error/timeout?
[10:14:38] <elukey>	 btullis: I see, `kubectl get events -A | grep FailedCreatePodSandBox` shows some info
[10:16:53] <btullis>	 elukey: Thanks, that's useful. 
[10:20:19] <elukey>	 btullis: IIUC helm rolled back all the pods right?
[10:20:55] <btullis>	 Yes, it's marked as atomic, so ut uninstalled the whole chart.
[10:22:03] <elukey>	 it would be useful to inspect the pods when they are in the process of being created, to check logs etc..
[10:25:11] <btullis>	 I can start another deploy now. Do you want to look at it together? As far as I can see, all of the calico-node and calico-typha pods are up and ready.
[10:31:34] <btullis>	 Ah, possibly missing `profile::kubernetes::deployment_server_secrets::admin_services`entry for calico on this cluster entry in the private repo.
[10:32:45] <btullis>	 ..but ml_serve doesn't exist there either.
[10:41:28] <elukey>	 btullis: I tried with get pods -A and I don't see them, let's resync after lunch what do you think?
[10:41:36] <elukey>	 (going afk now, I'll ping you in the afternoon)
[10:42:02] <btullis>	 :+1 Thanks.
[11:46:24] <claime>	 Can I get confirmation this https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Removing_old_appservers_from_production_(decom) is the right procedure for decom?
[11:57:49] <jynus>	 claime: there is a more detailed procedure for all servers, but some of the steps for just app servers should  be there
[11:58:46] <claime>	 ok, thanks jynus 
[11:58:53] <jynus>	 claime: check also https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission (valid for all servers, but won't have service specific instructions)
[12:00:03] <jynus>	 "System services must be confirmed to be offline. If server is part of a service pool, ensure it is set to false or removed completely from pybal/LVS" are the things that will be missing from the general instructions
[12:02:22] <claime>	 Thanks for all the info :)
[12:15:42] <Emperor>	 moritzm: am I OK to merge 'Muehlenhoff: puppet_compiler: Assign SPDX headers (53e5b747f7)' ?
[12:16:14] <Emperor>	 (it looks benign but seems best to check :) )
[12:26:41] <moritzm>	 Emperor: oh, sorry, yes.
[12:27:48] <Emperor>	 NP, doing so now
[12:46:52] <moritzm>	 thx
[12:55:47] <moritzm>	 starting in 10m I'm planning to disable Puppet in codfw and the edges for 10-15 mins for some puppetdb maintenance
[12:55:59] <moritzm>	 let me know if that's a bad time and I can do it later instead
[12:57:04] <addshore>	 Ahoy there! I I wanted some "user space" within the cluster somewhere to temporarily store a 1.1TB file before downloading it myself, would there be any place I have access too with such scratch space?
[12:59:09] <btullis>	 addshore: Which cluster are you referring to?
[12:59:25] <moritzm>	 the stat* hosts have their /home on a /srv partition which typically has a few TBs available, those are probably your best bet if it's just a few days?
[12:59:40] <addshore>	 as close to production (specifically wdqs1009.equid as possible)
[12:59:57] <addshore>	 moritzm: Hopfully less than a few days *looks at the stat achines*
[13:01:27] <addshore>	 aaah yes, stat1008 has 2.7T free
[13:01:53] <btullis>	 Yep, or stat1004 
[13:01:59] <addshore>	 *checks stat1004*
[13:02:28] <addshore>	 4.3T free, I might take that one! :P
[13:05:29] <addshore>	 hmmmm, how about the easiest way to move large files between hosts such as wdqs1009 and stat1004?
[13:07:54] <btullis>	 There is this, but you'll need root to do it: https://wikitech.wikimedia.org/wiki/Transfer.py (I'm happy to help run it, if you like)
[13:08:28] <dcausse>	 addshore: we use some cookbooks for this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wdqs/data-transfer.py
[13:09:38] <volans>	 also either rate-limit the transfer or check with netops to prevent alerts for high bandwidth firing
[13:09:56] <dcausse>	 the allowed port should be 9876
[13:10:19] <addshore>	 *clicks and reads the 2 links*
[13:11:00] <dcausse>	 the wdqs data-transfer might not work tho as it needs two wdqs machines on both ends
[13:11:11] <addshore>	 okay great, so sounds like it'll be best to have someone do the transfer for me using this transfer.py thingy
[13:11:52] <dcausse>	 yes as long as you stop blazegraph manually before transfering (and restart it when the transfer is done)
[13:13:50] <jynus>	 "Ensure legal html en.wb" seems to have been broken again, that is such a small patch (e.g. last one was https://gerrit.wikimedia.org/r/c/operations/puppet/+/820647 ) that anyone could take over, if not I will have a look at it later
[13:17:32] <addshore>	 dcausse: ack!
[13:17:43] <addshore>	 my laptop crashed at the perfect time in that conversation.... back now...
[13:18:14] <addshore>	 so 1) turn of the blazegraph 2) ask someone with root to copy wdqs1009:/srv/wdqs/wikidata.jnl -> stat1004:/home/addshore/wikidata.jnl 3) turn on blazegraph when done
[13:20:13] <moritzm>	 Puppet has been re-enabled on codfw/edges
[13:20:27] <dcausse>	 addshore: that's it
[13:21:55] <addshore>	 btullis: if I were to poke you to run this then then, what extra info would you like? :)
[13:23:51] <btullis>	 addshore: I think that's all the info I need. You can assure me that blazegraph is stopped, as dcausse points out? I can ping you when it's finished so that you can restart it.
[13:26:46] <addshore>	 let me go and stop the thing
[13:26:48] <addshore>	 :)
[13:29:11] <addshore>	 dcausse: the docs I am reading seem out of date, I see `sudo service wdqs-blazegraph` but there is no such service?
[13:29:59] <dcausse>	 addshore: sudo systemctl stop wdqs-blazegraph.service
[13:31:36] <addshore>	 dcausse: thanks
[13:31:44] <addshore>	 btullis: all stopped
[13:33:12] <dcausse>	 addshore: I see blazegraph running there
[13:33:32] <addshore>	 Hmmm `Sep 05 13:30:15 wdqs1009 systemd[1]: Stopped Query Service - Blazegraph - wdqs-blazegraph.`
[13:33:59] <dcausse>	 I see active (running) since Mon 2022-09-05 13:31:03 UTC; 2min 45s ago
[13:34:26] <addshore>	 oh, yes.... btullis ^^ fyi its running again
[13:34:47] <addshore>	 dcausse: I certainly didnt restart it, and did see it stopped :/
[13:35:02] <btullis>	 addshore: ack. Maybe puppet restarted it?
[13:35:41] <dcausse>	 addshore: looking
[13:35:58] <dcausse>	 the cookbook does disable puppet so perhaps it's puppet?
[13:36:28] <dcausse>	 stopped it again
[13:36:33] <addshore>	 I can only imagine I can't disable puppet, so I might have to get other folks to run all of the commands for me :)
[13:37:58] <dcausse>	 we can perhaps drop the /srv/wdqs/data_loaded flag that should prevent it from starting
[13:38:43] <dcausse>	 well no actually ...
[13:41:05] <btullis>	 I have temporarily disabled puppet on wqsd1009
[13:41:47] <dcausse>	 ok trying again
[13:43:32] <dcausse>	 sigh... I think that stopping causes: wdqs-blazegraph.service: Main process exited, code=exited, status=143/n/a
[13:43:43] <dcausse>	 and then systemctl attempts to restart it I suppose
[13:47:45] <dcausse>	 disabled the service and looks like it's not starting again
[13:48:22] <dcausse>	 btullis: I think you can start the transfer
[13:50:27] <btullis>	 dcausse: Ack, started now.
[13:55:17] <addshore>	 thanks both!
[14:25:44] <btullis>	 addshore: The copy is about 20% complete.
[14:25:54] <addshore>	 amazing!
[15:33:44] <dcaro>	 XioNoX, topranks hey, what do I need to setup/configure to be able to access cloudvps public ips from a bare metal in the private prod network? (only direction: prod -> public cloud vps) 
[15:34:35] <dcaro>	 from https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#/media/File:New_service_IP_flow_chart.png I'd say to use squid?
[15:34:55] <dcaro>	 (usecase is fetching prometheus data from a public prometheus endpoint exposed from the cloud vps network)
[15:35:02] <dcaro>	 (using https)
[15:35:16] <XioNoX>	 dcaro: yep: https://wikitech.wikimedia.org/wiki/HTTP_proxy
[15:36:13] <dcaro>	 👍 thaks a lot! will look how that integrates with prometheus xd
[15:36:56] <XioNoX>	 dcaro: not directly related but that might help: https://phabricator.wikimedia.org/T303803
[15:37:10] <dcaro>	 nice!
[15:48:29] <addshore>	 volans: "rate-limit the transfer or check with netops to prevent alerts for high bandwidth firing" Who is best to chat to? and where? :) (For when I then get this file off the stat machine over the internet)
[15:49:30] <volans>	 addshore: that was mainly for the internal trasfer, that could potentially saturate the host uplink
[15:50:23] <volans>	 but ofc I don't know to where you're trasferring it maybe a place with much more bandwidth that us :D
[15:51:15] <addshore>	 First port of call would be google cloud so that I can go ahead and delete it from the stat machine, and I expect that side of the pipe might be larger than the wikimedia one
[15:52:14] <btullis>	 For the record, I added a note to my favourite IRC channel #wikimedia-sre-foundations-netops-notify-transfer and asked about rate limiting the transfer in #wikimedia-netops - The copy to stat1004 is about 70% complete.
[16:03:02] <XioNoX>	 addshore: happy to chat, in a meeting right now though. What's the context?
[16:07:16] <addshore>	 XioNoX: transferring a 1.1TB file (wdqs index) out of production and into google cloud, and wanting to know if I should try to apply any rate limits etc 
[16:08:37] <XioNoX>	 addshore: from where?
[16:08:54] <XioNoX>	 it it like on a single box?
[16:08:57] <XioNoX>	 is*
[16:11:58] <addshore>	 yeah, on a single ox (stat1004)
[16:13:43] <addshore>	 thinking about it, I will also have to go via webproxy.eqiad.wmnet, as I was going to curl it out to the google storage API
[16:14:52] <addshore>	 But that also means I think I can just pass a `--limit-rate` to curl
[16:15:31] <XioNoX>	 addshore: that host only have a 1G uplink, so it's going to work as natural rate limiter
[16:15:56] <XioNoX>	 the host's nic will saturate but no risk of issues on the infra
[16:16:59] <addshore>	 Okay, lovely, I'll add a limit-rate of a little under 1G then as well so as not to saturate the host uplink too
[16:17:03] <addshore>	 thanks!
[16:17:25] <XioNoX>	 sounds good :)
[17:03:17] <addshore>	 btullis: looks like it is approaching completion!
[17:26:06] <btullis>	 addshore: Yep, copy completed. I've re-enabled puppet on wdqs1009 and done `sudo chown addshore:wikidev /home/addshore/wikidata.jnl` on stat1004.
[17:26:54] <btullis>	 I think blazegraph will probably be restarted by puppet on wdqs1009 - given what happened earlier.
[18:36:58] <addshore>	 yup, Active: active (running) since Mon 2022-09-05 17:28:14 UTC; 1h 8min ago
[18:37:08] <addshore>	 thanks all!
[18:46:03] <addshore>	 `curl: option --data-binary: out of memory` heh, so thats what happens when you try to upload a 1.1TB file with curl