[05:35:11] <_joe_> andrewbogott: puppetmaster1001:~$ puppet --version [05:35:13] <_joe_> 5.5.22 [05:35:18] <_joe_> we're on 5.5 [05:49:08] right, but did they start spamming us with deprecation notices before the replacement was available? I'm used to a deprecation notice being a call to action, not just a warning of future action [05:54:29] <_joe_> andrewbogott: I am just stating you should probably look at the docs for 5.5 [05:54:49] <_joe_> 6.x is the first clojure version IIRC [06:16:17] o/ [06:34:13] fyi, transport between esams and eqiad is down, I opened a ticket with Lumen: https://phabricator.wikimedia.org/T317009 [06:34:59] <_joe_> that's not great [06:37:07] _joe_: we now have last hope redundancy through drmrs if GTT (the normal redundant link) fails too, so no need to depool for now [08:43:07] hi all back from vacation, will mostly be catching up on emails today but feel free top ping if there is soemthing to pop to the top of the list [08:43:36] welcome back! [08:48:17] thaks :) [08:51:45] welcome back :-) [08:59:50] Morning all. I'm planning to deploy calico to the dse-k8s cluster soon, as per: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Calico_node/controllers - Any reason to hold off or any other concerns? [09:06:51] btullis: o/ do you have the patch for core routers ready? [09:07:05] otherwise Calico pods will not be able to establish BGP sessions [09:10:25] elukey: It's already deployed, I believe. Took a few goes and a final fix by topranks: https://gerrit.wikimedia.org/r/q/bug:T310174 - but I think that core switches and ToR in eqiad E-F are all ready. [09:10:25] T310174: Configure routing for dse-k8s cluster - https://phabricator.wikimedia.org/T310174 [09:14:06] wonderful :) [09:14:46] IIUC jayme is doing some work on Calico's charts but it should be ok to proceed in theory if the networking part is set up [09:16:44] my calico work is prep for the k8s 1.23 upgrade only and I will pin the calico chart versions to the current lates in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/826268 [09:16:55] so no change there [09:20:23] Thanks both. Proceeding to deploy then, if there are no other concerns. [10:01:56] I'm getting a timeout on deploying calico-kube-controllers at ContainerCreating - investigating now. [10:13:38] btullis: a helm error/timeout? [10:14:38] btullis: I see, `kubectl get events -A | grep FailedCreatePodSandBox` shows some info [10:16:53] elukey: Thanks, that's useful. [10:20:19] btullis: IIUC helm rolled back all the pods right? [10:20:55] Yes, it's marked as atomic, so ut uninstalled the whole chart. [10:22:03] it would be useful to inspect the pods when they are in the process of being created, to check logs etc.. [10:25:11] I can start another deploy now. Do you want to look at it together? As far as I can see, all of the calico-node and calico-typha pods are up and ready. [10:31:34] Ah, possibly missing `profile::kubernetes::deployment_server_secrets::admin_services`entry for calico on this cluster entry in the private repo. [10:32:45] ..but ml_serve doesn't exist there either. [10:41:28] btullis: I tried with get pods -A and I don't see them, let's resync after lunch what do you think? [10:41:36] (going afk now, I'll ping you in the afternoon) [10:42:02] :+1 Thanks. [11:46:24] Can I get confirmation this https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Removing_old_appservers_from_production_(decom) is the right procedure for decom? [11:57:49] claime: there is a more detailed procedure for all servers, but some of the steps for just app servers should be there [11:58:46] ok, thanks jynus [11:58:53] claime: check also https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission (valid for all servers, but won't have service specific instructions) [12:00:03] "System services must be confirmed to be offline. If server is part of a service pool, ensure it is set to false or removed completely from pybal/LVS" are the things that will be missing from the general instructions [12:02:22] Thanks for all the info :) [12:15:42] moritzm: am I OK to merge 'Muehlenhoff: puppet_compiler: Assign SPDX headers (53e5b747f7)' ? [12:16:14] (it looks benign but seems best to check :) ) [12:26:41] Emperor: oh, sorry, yes. [12:27:48] NP, doing so now [12:46:52] thx [12:55:47] starting in 10m I'm planning to disable Puppet in codfw and the edges for 10-15 mins for some puppetdb maintenance [12:55:59] let me know if that's a bad time and I can do it later instead [12:57:04] Ahoy there! I I wanted some "user space" within the cluster somewhere to temporarily store a 1.1TB file before downloading it myself, would there be any place I have access too with such scratch space? [12:59:09] addshore: Which cluster are you referring to? [12:59:25] the stat* hosts have their /home on a /srv partition which typically has a few TBs available, those are probably your best bet if it's just a few days? [12:59:40] as close to production (specifically wdqs1009.equid as possible) [12:59:57] moritzm: Hopfully less than a few days *looks at the stat achines* [13:01:27] aaah yes, stat1008 has 2.7T free [13:01:53] Yep, or stat1004 [13:01:59] *checks stat1004* [13:02:28] 4.3T free, I might take that one! :P [13:05:29] hmmmm, how about the easiest way to move large files between hosts such as wdqs1009 and stat1004? [13:07:54] There is this, but you'll need root to do it: https://wikitech.wikimedia.org/wiki/Transfer.py (I'm happy to help run it, if you like) [13:08:28] addshore: we use some cookbooks for this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wdqs/data-transfer.py [13:09:38] also either rate-limit the transfer or check with netops to prevent alerts for high bandwidth firing [13:09:56] the allowed port should be 9876 [13:10:19] *clicks and reads the 2 links* [13:11:00] the wdqs data-transfer might not work tho as it needs two wdqs machines on both ends [13:11:11] okay great, so sounds like it'll be best to have someone do the transfer for me using this transfer.py thingy [13:11:52] yes as long as you stop blazegraph manually before transfering (and restart it when the transfer is done) [13:13:50] "Ensure legal html en.wb" seems to have been broken again, that is such a small patch (e.g. last one was https://gerrit.wikimedia.org/r/c/operations/puppet/+/820647 ) that anyone could take over, if not I will have a look at it later [13:17:32] dcausse: ack! [13:17:43] my laptop crashed at the perfect time in that conversation.... back now... [13:18:14] so 1) turn of the blazegraph 2) ask someone with root to copy wdqs1009:/srv/wdqs/wikidata.jnl -> stat1004:/home/addshore/wikidata.jnl 3) turn on blazegraph when done [13:20:13] Puppet has been re-enabled on codfw/edges [13:20:27] addshore: that's it [13:21:55] btullis: if I were to poke you to run this then then, what extra info would you like? :) [13:23:51] addshore: I think that's all the info I need. You can assure me that blazegraph is stopped, as dcausse points out? I can ping you when it's finished so that you can restart it. [13:26:46] let me go and stop the thing [13:26:48] :) [13:29:11] dcausse: the docs I am reading seem out of date, I see `sudo service wdqs-blazegraph` but there is no such service? [13:29:59] addshore: sudo systemctl stop wdqs-blazegraph.service [13:31:36] dcausse: thanks [13:31:44] btullis: all stopped [13:33:12] addshore: I see blazegraph running there [13:33:32] Hmmm `Sep 05 13:30:15 wdqs1009 systemd[1]: Stopped Query Service - Blazegraph - wdqs-blazegraph.` [13:33:59] I see active (running) since Mon 2022-09-05 13:31:03 UTC; 2min 45s ago [13:34:26] oh, yes.... btullis ^^ fyi its running again [13:34:47] dcausse: I certainly didnt restart it, and did see it stopped :/ [13:35:02] addshore: ack. Maybe puppet restarted it? [13:35:41] addshore: looking [13:35:58] the cookbook does disable puppet so perhaps it's puppet? [13:36:28] stopped it again [13:36:33] I can only imagine I can't disable puppet, so I might have to get other folks to run all of the commands for me :) [13:37:58] we can perhaps drop the /srv/wdqs/data_loaded flag that should prevent it from starting [13:38:43] well no actually ... [13:41:05] I have temporarily disabled puppet on wqsd1009 [13:41:47] ok trying again [13:43:32] sigh... I think that stopping causes: wdqs-blazegraph.service: Main process exited, code=exited, status=143/n/a [13:43:43] and then systemctl attempts to restart it I suppose [13:47:45] disabled the service and looks like it's not starting again [13:48:22] btullis: I think you can start the transfer [13:50:27] dcausse: Ack, started now. [13:55:17] thanks both! [14:25:44] addshore: The copy is about 20% complete. [14:25:54] amazing! [15:33:44] XioNoX, topranks hey, what do I need to setup/configure to be able to access cloudvps public ips from a bare metal in the private prod network? (only direction: prod -> public cloud vps) [15:34:35] from https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#/media/File:New_service_IP_flow_chart.png I'd say to use squid? [15:34:55] (usecase is fetching prometheus data from a public prometheus endpoint exposed from the cloud vps network) [15:35:02] (using https) [15:35:16] dcaro: yep: https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:36:13] 👍 thaks a lot! will look how that integrates with prometheus xd [15:36:56] dcaro: not directly related but that might help: https://phabricator.wikimedia.org/T303803 [15:37:10] nice! [15:48:29] volans: "rate-limit the transfer or check with netops to prevent alerts for high bandwidth firing" Who is best to chat to? and where? :) (For when I then get this file off the stat machine over the internet) [15:49:30] addshore: that was mainly for the internal trasfer, that could potentially saturate the host uplink [15:50:23] but ofc I don't know to where you're trasferring it maybe a place with much more bandwidth that us :D [15:51:15] First port of call would be google cloud so that I can go ahead and delete it from the stat machine, and I expect that side of the pipe might be larger than the wikimedia one [15:52:14] For the record, I added a note to my favourite IRC channel #wikimedia-sre-foundations-netops-notify-transfer and asked about rate limiting the transfer in #wikimedia-netops - The copy to stat1004 is about 70% complete. [16:03:02] addshore: happy to chat, in a meeting right now though. What's the context? [16:07:16] XioNoX: transferring a 1.1TB file (wdqs index) out of production and into google cloud, and wanting to know if I should try to apply any rate limits etc [16:08:37] addshore: from where? [16:08:54] it it like on a single box? [16:08:57] is* [16:11:58] yeah, on a single ox (stat1004) [16:13:43] thinking about it, I will also have to go via webproxy.eqiad.wmnet, as I was going to curl it out to the google storage API [16:14:52] But that also means I think I can just pass a `--limit-rate` to curl [16:15:31] addshore: that host only have a 1G uplink, so it's going to work as natural rate limiter [16:15:56] the host's nic will saturate but no risk of issues on the infra [16:16:59] Okay, lovely, I'll add a limit-rate of a little under 1G then as well so as not to saturate the host uplink too [16:17:03] thanks! [16:17:25] sounds good :) [17:03:17] btullis: looks like it is approaching completion! [17:26:06] addshore: Yep, copy completed. I've re-enabled puppet on wdqs1009 and done `sudo chown addshore:wikidev /home/addshore/wikidata.jnl` on stat1004. [17:26:54] I think blazegraph will probably be restarted by puppet on wdqs1009 - given what happened earlier. [18:36:58] yup, Active: active (running) since Mon 2022-09-05 17:28:14 UTC; 1h 8min ago [18:37:08] thanks all! [18:46:03] `curl: option --data-binary: out of memory` heh, so thats what happens when you try to upload a 1.1TB file with curl