[07:24:26] Hi folks! I am going to do some errands but I'll be back later on :) [07:27:31] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) After a chat with the team, we'll do the following: 1) Fold scipy 0.18.1 in ORES' wheels and remove python3-scipy from the hosts (even on stretch nodes) 2) Create a new... [08:13:40] back, I had the wrong time for my appointment, will have to go out again later on [08:45:54] afk again, this time should be the right one :) [09:28:13] back :) [09:38:57] 10Machine-Learning-Team: Use the scipy wheel instead of python3-scipy for ORES - https://phabricator.wikimedia.org/T305441 (10elukey) [09:39:19] 10Machine-Learning-Team: Use the scipy wheel instead of python3-scipy for ORES - https://phabricator.wikimedia.org/T305441 (10elukey) @kevinbazira @AikoChou do you have time to look into this task? [09:52:49] klausman: since we need to reimage ml-serve-ctrl*, I thought to just use bullseye https://gerrit.wikimedia.org/r/c/operations/puppet/+/777332/ [09:52:52] what do you think? [10:02:17] 10Machine-Learning-Team: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users - https://phabricator.wikimedia.org/T305447 (10elukey) [10:02:59] LGTM [10:03:35] Mh, actually. It does become a bit of a timebomb until we are done with everything [10:03:44] But it's not super dangerous [10:04:57] with all downtimed and stopped it shouldn't be a problem, worst case scenario if it doesn't work we can revert to buster [10:05:07] but it will save us the reimage for the very next future [10:05:09] ack. [10:05:47] I meant: having a changed image spec in that file, and not "enacting" it, and then someone innocently reimages the machine without knowing that it would get a different distro it already has. [10:06:01] But the time window is very short here, so no worries. [10:06:25] If we had weeks-months between doing codfw and eqiad, I'd prefer not to reconfig eqiad before we're done with codfw [10:08:02] sure sure [10:22:41] going afk for lunch, ttl for the first cluster reinit! [10:29:52] lunch seems like a great idea :) [13:05:06] 10Machine-Learning-Team: Use the scipy wheel instead of python3-scipy for ORES - https://phabricator.wikimedia.org/T305441 (10elukey) 05Open→03Stalled I am trying to see if our code can run with scipy 1.1 (Buster's version), will report back once I've tested it :) [13:05:08] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ORES to Debian Buster or Bullseye - https://phabricator.wikimedia.org/T303801 (10elukey) [13:29:22] elukey: I presume we will put downtimes in in both Icinga and AM? [13:29:37] (also, did you get any info from Filippo regarding pybal et al? [13:29:40] ) [13:31:23] yes I think that we can put host downtimes for all ml-serve* eqiad nodes [13:31:33] and also try the downtime cookbook for the lvs endpoint [13:31:40] so, in the eqiad case [13:31:47] inference.svc.eqiad.wmnet [13:32:04] and the control plane svc endpoint (I don't recall it exactly) [13:32:12] do you want to take care of it? [13:34:24] I'll do the LVS bits, yes. Want me to also do the Icinga and AM side? [13:37:41] klausman: yep yep all :) [13:37:55] let's do 3/4 hours just to be sure [13:38:28] I can't find an lvs cookbook [13:40:30] IIUC we can use the host downtime one [13:40:38] since from the icinga point of view those are hosts [13:40:48] ack [13:41:47] Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ml-serve[1001-1004],ml-serve-ctrl[1001-1002] [13:41:56] Length 4h [13:42:29] hmm, the cookbook does not like inference.svc.eqiad.wmnet [13:42:42] doing it via the webui [13:44:03] what does it say ? We can report it to Filippo in case [13:44:35] https://phabricator.wikimedia.org/P24118 [13:45:20] interesting [13:45:27] We're not touching the etcds, right? [13:45:43] we are just wiping them, but no downtime [13:45:49] roger [13:45:56] Then the downtimes are all set [13:46:00] we'd need also to downtime in both UIs [13:46:10] both UIs? [13:46:21] icinga.wikimedia.org and alerts.wikimedia.org/ [13:47:09] I dunno how tid non-firing alerts in the latter [13:47:30] didn't you add downtime today for ml-staging? [13:47:42] for rsyslog, which was firing, yes [13:48:31] I can probably derive it from that, sec [13:50:18] ahh okok [13:50:36] so if you check the bell icon at the top-right corner, there is a menu for downtime [13:50:44] you can add a regex as well [13:50:46] Hrm. The UI does not help finding values for non-firing stuf :( [13:51:03] would cluster=ml_server and dc=eqiad be enough? Who knows! [13:51:09] serve* [13:51:35] I'd use instance=ml-serve.* [13:51:53] for the hosts, then the same for inference and the other lvs endpoint [13:52:10] which other endpoint? [13:53:33] the lvs endpoint in front of the the ml-ctrl nodes [13:53:58] ml-ctrl.svc.eqiad.wmnet? [13:54:09] ml-ctrl.svc.eqiad.wmnet [13:54:10] yes [13:54:22] in both icinga and AM [13:55:29] ml-ctrl does not exist in Icinga [13:57:51] ah interesting, the endpoint is still in lvs_setup [13:57:56] so no monitoring yet [13:58:04] something to fix after the reimages :D [13:58:10] Ah, we never completed that last step, yea [13:58:14] ok so if all downtimes are set, I'd do [13:58:29] 1) alert others in #wikimedia-sre about what we are doing [13:58:53] 2) shutdown ml-serve1001 and ml-serve-ctrl1002, wait a bit and see what alarms fire (if any) [13:59:03] if none, we can shutdown everything [13:59:07] does it sound good? [13:59:07] SGTM [13:59:24] all right I'll let you do it :) [13:59:32] Of course :-P [14:01:18] There's no shutdown cookbook that I can see, so I presume we do it by hand? [14:01:39] yep! [14:02:17] Ok, so you meantioned a ctrl and worker each, I presume that was deliberate [14:03:19] I'll shut them down at 1610 CEST (1410 UTC) unless we get nasty alerts [14:04:00] yes yes so we see what fires, if any [14:04:55] erwait, brain fart [14:05:21] I'll shut them down _now_ (two hosts) and then we'll wait until 1610+ to see fireworks [14:12:20] ack [14:12:54] https://gerrit.wikimedia.org/r/c/operations/puppet/+/776879 [14:13:04] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/776876 [14:13:09] prepping the code changes as well [14:13:50] Morning! [14:13:54] \o Chris [14:14:08] o/ [14:14:09] elukey: wasn't this covered by the CLs from Friday? [14:14:20] Oh wait, those _are_ :D [14:14:30] yep yep we just need to merge them [14:14:30] I thought you had made new ones for me to review [14:14:34] nono :) [14:15:01] Seeing as how there have been no alerts firing, shall I shutdown the remaining hosts? [14:15:53] +1 [14:16:09] ok, on it [14:17:28] ok, all shutting down now. Should be less than a minute until they're off [14:17:57] super [14:18:11] once they are off, we can merge the changes above [14:18:32] then we wipe etcd, and after that we should be ready to start the first reimage [14:23:14] there was an LVS alert for the inference endpoint but nothing major [14:25:57] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1020&service=PyBal+backends+health+check What about this one? do we need to depool stuff? [14:26:10] also the serve-ctrl VMs keep coming back after shutdown [14:27:44] I'd say then to use ganeti's shutdown [14:27:52] done [14:27:54] gnd-instance shutdown ml-serve-ctrl etc.. [14:28:05] for the Pybal alerts we can ack them [14:29:30] I can't seem to find an Ack link in the Icinga ui [14:30:03] it is in the dropdown menu at the top-right corner [14:30:08] "Select command" [14:30:48] I don't see a dropdown [14:31:29] ah ok sorry, go to the main page, and then in the left panel you'll see either "all alerts" or "all unhandled alerts" [14:31:39] hit one of them and then you'll see the page [14:32:17] the pybal alerts may stay as they are [14:32:21] I can't beleieve there is no ack'ing link on the rpoblematic service's page [14:32:38] if we ack them we may hide some other issues (if any) of other endpoints [14:32:48] I'd ack only the inference.svc ones [14:33:38] Ok,. that is done [14:33:45] super [14:33:56] if all the nodes are down, I'd merge the two CRs [14:34:06] and then wipe etcd [14:34:17] sgtm [14:35:35] done [14:35:56] going to execute [14:35:57] ETCDCTL_API=3 etcdctl --endpoints https://ml-etcd1001.eqiad.wmnet:2379 del "" --from-key=true [14:36:00] ack? [14:36:10] on ml-etcd1001 [14:36:29] klausman: --^ [14:38:34] ack [14:41:56] mmm I can still see the /calico root [14:42:52] is there anything in/under it? [14:43:07] yep [14:44:41] you could always try rm-r [14:45:02] i.e. `etcdctl -C https://ml-etcd1002.eqiad.wmnet:2379 rm -r calico/` [14:45:46] oh, I just noticed -C is Not Done Anymore. Ok, --endpoints then :) [14:46:13] I used etcdctl -C https://$(hostname -f):2379 rm -r /calico [14:46:16] and worked :) [14:46:24] excellent [14:46:25] now I dont see anything if I type `ls /` [14:46:31] ml-etcd1001 ~ $ etcdctl -C https://ml-etcd1002.eqiad.wmnet:2379 ls [14:46:31] can you doubcle check just to be sure [14:46:32] ? [14:46:32] ml-etcd1001 ~ $ [14:46:34] confirmed [14:46:37] super [14:47:04] ok so now we can reimage ml-serve-ctrl1001 [14:47:23] going to do it now [14:48:41] klausman: in the meantime, can you check the alerts for ml-serve-ctrl? [14:48:48] ack [14:50:33] and silenced (AM) [14:51:20] super thanks [14:54:29] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey closed [14:58:25] I am running puppet on ml-serve-ctrl1001 [14:58:35] klausman: once done, do you want to do 1002? [14:58:42] can do [15:02:29] https://i.imgur.com/4oW4csd.png <- neat tool for such things (parallel pinger with assorted output modules) [15:07:19] the host seems ok, I see on the kubelet's logs that there are some problems registering the node though [15:07:36] what is it missing? [15:09:25] Unable to register node "ml-serve-ctrl1001.eqiad.wmnet" with API server: nodes is forbidden: User "kubelet" cannot create resource "nodes" in API group "" at the cluster scope [15:10:05] maybe it is a matter of rbac rules, don't recall what we did originally [15:10:12] Let me have a think [15:10:58] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Apply_RBAC_rules_and_PSPs You mean these? [15:11:44] yes [15:12:41] Should I run them? [15:13:09] let's reimage 1002 now [15:13:18] ok, doing that [15:13:28] do you recall all the procedure? [15:13:33] yup [15:13:46] super, I will do ml-serve1001 in the meantime if you agree [15:14:02] set to netboot, start, attach console, wait for install to finish, (auto-shutdown), set to disk boot, start VM, do puppet stuff as usual [15:14:29] you also need to clean the puppet cert on the puppet master [15:14:44] then you have to use install_console to generate the new cert, sign, etc.. [15:14:46] yes that I am doing while the install is running [15:14:57] aka "puppet stuff as usual" :) [15:15:07] :) [15:20:30] puppet cert done, now doing first puppet run [15:23:08] elukey: waait, maybe 1001 just needs a reboot, post-puppet? [15:25:00] klausman: already done [15:25:46] I suspected so :) [15:28:50] rebooting 1002 [15:29:37] klausman: going to kick off the reimage of the remaining workers [15:29:59] ack [15:30:25] mmm the cookbook seems to allow only one node [15:32:03] yeah, it's a bit odd [15:32:08] ok kicked off all reimages in separate tmuxes [15:32:35] going to take a little break, bbiab [15:32:38] wait [15:32:43] ? [15:32:49] should I run the helm/rbac thing, see if it fixes the ctrl nodes [15:33:08] let's wait that all nodes are up with the kubelet [15:33:13] ok [15:33:15] then we sync and let them register [15:33:18] should work fine [16:03:43] All hosts seem to be back [16:04:04] But kubectl refuses to work for me [16:06:43] elukey: I have a date later tonight (in 45m or so), anything I can do right now? [16:07:31] klausman: sorry in a meeting, but you can try to sync the rbac/etc.. rules if you wnat [16:07:35] otherwise no problem! [16:07:43] sure, will do the rbac thing [16:19:35] makes no difference [16:19:43] The SSL certs are also wrong :-/ [16:19:53] `certificate is valid for ml-ctr1001.eqiad.wmnet, ml-ctr1002.eqiad.wmnet, ml-ctrl.svc.eqiad.wmnet, kubernetes.default.svc.cluster.local` [16:20:13] not mising L in ctrl, plus it should be ml-serve-ctrl1001 etc, no? [16:25:19] meeting finished [16:25:34] where did you get the error? [16:25:43] ml-serve-ctrl1001, running kubectl [16:25:55] The rbac stuff ran fine, but seemingly made no difference [16:26:16] kubectl without -s tries to use localhost:8080, which isn't there [16:26:32] so kubectl needs to be ran from deploy1002 in our current setup [16:26:51] it is not available anymore on ml-serve-ctrl since it was using the default unauthenticated socket [16:27:01] ml-serve1001's kubelet says "Successfully registered node ml-serve1001.eqiad.wmnet" [16:27:04] that is good [16:27:11] `kubectl -s https://ml-ctrl.svc.eqiad.wmnet:6443/ cluster-info` asks for username/password [16:27:31] and I also see Successfully registered node ml-serve-ctrl1001.eqiad.wmnet [16:28:11] as root on deploy1002 you have to execute `kube_env admin ml-serve-eqiad` [16:28:18] then kubectl get pods -A should work [16:28:55] the SSL certs are good afaics, ml-ctrl.svc.eqiad.wmnet is the right LVS endpoint [16:29:06] (behind it there are the ml-serve-ctrl100x nodes) [16:29:10] yes, but the extra CNAMEs are useless [16:29:23] (i.e. ml-ctr1001.eqiad.wmnet, ml-ctr1002.eqiad.wmnet) [16:29:36] we should either fix or remove them from the certs [16:30:34] klausman: I don't recall why it was done in that way, but we can take a look for sure [16:31:09] I think we (or I, since I made the certs) were simply unsure whether the individual node names would be needed, and then got it wrong, which just didn't break [16:31:44] It's more of a cleanup/consistency issue, but shouldn't be too hard to fix, I think. [16:32:28] in theory no, we'd need to fix the private repo, re-issue a new cert and deploy it [16:33:05] root@deploy1002:~# kubectl get nodes [16:33:05] NAME STATUS ROLES AGE VERSION [16:33:05] ml-serve-ctrl1001.eqiad.wmnet Ready 25m v1.16.15 [16:33:05] ml-serve-ctrl1002.eqiad.wmnet Ready 25m v1.16.15 [16:33:05] ml-serve1001.eqiad.wmnet Ready 25m v1.16.15 [16:33:08] ml-serve1002.eqiad.wmnet Ready 25m v1.16.15 [16:33:10] ml-serve1003.eqiad.wmnet Ready 25m v1.16.15 [16:33:13] ml-serve1004.eqiad.wmnet Ready 25m v1.16.15 [16:33:15] looks very good [16:33:37] yarp, now that I was gently reminded of the deploy1002 indirection ;) [16:34:16] I am going to sync the calico stuff ok? [16:34:21] Yep [16:34:33] are the labels needed/done? [16:35:50] they are not needed but the should be added upon first registration of the kubelet [16:43:00] anything else we need after Calico? [16:44:12] I made a mistake with calico's sync, it was hanging due to missing namespaces, so I ctrl+c helmfile [16:44:15] and now I get [16:44:17] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [16:44:23] never seen it, will try to sort it out [16:44:32] but after calico I think we can just follow the guide [16:44:49] to solve the pybal issues we may just need to bootstrap istio quickly [16:45:11] So I'll cancel dinner then [16:46:17] nono please :) [16:46:31] you can go, worst case I downtime everything for tomorrow [16:46:42] please go to dinner, super fine [16:46:51] klausman: --^ [16:47:11] Alright. I just don't want to leave you hanging [16:48:00] nah I'll try to leave in a bit as well.. Enjoy! [17:15:25] going to leave as well, downtimed hosts [22:59:26] 10Machine-Learning-Team, 10ORES, 10Edit-Review-Improvements-RC-Page, 10Growth-Team, 10Regression: [regression-wmf.20] Recent changes filters disappear from the menu - https://phabricator.wikimedia.org/T290113 (10matmarex) When the issue occurs, the "damaging" filter configuration is missing from `mw.conf...