[00:26:21] 10Machine-Learning-Team, 10artificial-intelligence, 10Edit-Review-Improvements-RC-Page, 10Growth community maintenance, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10MMiller_WMF) Thanks for the links, @1997kB. @kostajh @DMburugu @MShilova_WMF... [07:04:57] hello folks :) [07:17:00] rebooted all the ores codfw nodes, going to do also the ml-serve-ctrl nodes as well since they are easy [07:25:46] good morning :) [07:27:07] o/ [07:27:50] elukey: thanks, Luca. I see my home dir is back \o/ [08:03:23] aiko: for revscoring, if you want to send a complete new pull request it will be probably better and more clean, I'll abandon mine afterwards [08:03:28] what do you think? [08:05:40] klausman: o/ I have completed the reboots on ores2* and I've also done the orespoolcounter[12]* nodes (simple reboot cookbook one at the time, nothing fancy, IIUC these poolcounters can tolerate one node down) [08:05:59] I've also done ml-serve-ctrl* since it was easy and quick, and ores100[1-3] [08:06:22] I'll leave the rest to you (1004->1009), no rush anytime [08:12:17] elukey: sounds good. I'll send a new PR :) [08:12:24] thanksss [08:12:30] I am going afk for a quick errand, bbl! [09:22:57] elukey: does `sudo cookbook sre.hosts.reboot-single --depool -r "Reboot to activate new kernel for T304938" -t T304938 'ores1004*'` look sane to you? [09:23:35] klausman: yep! [09:23:46] What dashboard do you watch for errors? [09:24:11] so https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m and https://grafana.wikimedia.org/d/vAN_bQemz/ores-advanced-metrics?orgId=1&refresh=1m, but mostly the first [09:24:22] if there is anything really bad we'll see it soon [09:24:36] Alright. Starting with 1004 now [09:25:34] Oh. the reboot cookbook does not have a -t option? [09:26:08] oh wait, nvm [09:27:08] cumin can't put comments in sec tickets like that. And arguably, with thta many hosts, it would be spammy anyway [09:32:16] ack [09:43:20] ok, 1004 done, and I see no unusual blips and bops, proceeding with the rest, one at a time [09:46:11] super [09:57:47] klausman: I rechecked the ip pools after our last chat about what svc/pod subnets were needed, and I think that we should leave the biggest subnets for svcs and the smaller ones for pods, staging and prod (sorry for this back and forth, but better to discuss it two times before a cluster reinit) [09:58:04] I don't recall if we already changed the descriptions, but I'd do [09:58:39] /23 for svc and /24 for pods (staging), /20 svc and /21 pods (production) [09:58:59] the staging cluster will differ from the other service ops ones but we don't really care [09:59:13] what do you think? Sorry for the extra discusison [10:14:43] I'm confused :D [10:15:01] So we expect more services than pods? [10:15:13] Or more pods than services? [10:15:40] yeah I know sorry for the back and forth [10:15:43] more services than pods [10:16:07] since knative creates svcs for revisions etc.. [10:16:25] But wouldn't each service have at least one pod providing it? [10:17:27] IIUC this is not the case for knative, since a revision contains a svc + all the info to go back to its previous state (namely spin up a pod etc..) [10:18:00] But can one pod provide more than one service/version of a service? [10:18:16] not that I know [10:18:17] At the same time, that is [10:18:41] Then the number of pods would always be either equal to or larger than the number of services, no? [10:19:05] nope, otherwise I wouldn't have opened https://phabricator.wikimedia.org/T302701 :) [10:19:30] for example https://phabricator.wikimedia.org/T302701#7740625 [10:19:55] Mh. I think my brain fart was this: [10:20:39] Even if a service is down (but configured), it "eats" a service IP. So you could have 1000 services configured (using 1000 IPs), even if only one service and one pod were actually running. [10:21:02] Thus there can be more services than pods, even if it is counter-intuitive [10:21:20] (I had thought a service is only eating an IP if it is up/served by a pod) [10:21:33] Does this make sense? [10:21:53] yes this is my understanding as well [10:22:15] I was puzzled when I found the problem since I didn't expect it, but it seems the way that knative works [10:22:43] it is probably very handy to have the k revisions using the same svc etc.. [10:22:52] (from the devs I mean) [10:23:05] I mean, it makes a certain amount of sense, since you'd want to allocate all resources early, both to make startup fast and so networking can be setup [10:24:01] could be yes [10:24:02] So should we switch the config (and netbox) back to have more service IPs than pod IPs? [10:24:11] yes please, sorry again [10:24:14] np :) [10:24:16] both prod and staging [10:24:32] after that we should be more resilient in the future [10:24:33] I'll do the changes after the ORES reboots [10:24:50] yep no problem, remember to change the subnet in your ml-staging patch as well [10:24:57] then I think it should be ready [10:25:37] the other question mark that I have for IP pools is ipv6, since we have allocations for the current clusters [10:25:59] we shouldn't have issues in there in theory :) [10:27:39] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores1007 <- Is this a problem? What to do about it? [10:28:28] weird, the cookbook finished correctly [10:28:48] Yeah, but at the bottom it said it won't depool because it's not healthy in icinga [10:28:56] ah ok that makes sense [10:29:07] is there anything ongoing with celery on the box? [10:29:19] the alert is on 2/10 failures, so maybe it just takes time? [10:29:50] 1006 was similar but recovered (and I pooled it by hand) [10:30:13] (God I hate the Icinga webui..) [10:31:12] ah yes I think I know what's happening [10:31:14] we have profile::ores::celery::workers: 90 [10:31:42] the check is defined in profile::ores::web [10:31:54] it wants at least 88 celery processes [10:32:39] elukey@ores1007:~$ /usr/lib/nagios/plugins/check_procs -C celery -c 88:92 [10:32:39] PROCS OK: 91 processes with command name 'celery' | procs=91;;88:92;0; [10:33:03] Ah, so we just have to wait for (or force) the next check? [10:33:09] klausman: we can force the next service check via icinga UI or just wait for the next check to be schedueld [10:33:15] exactly [10:33:30] SOme things never change %-) [10:33:58] This was the thing to do in Nagios ca. 2003. [10:34:38] and it's green. repooling [10:39:39] hopefully we'll decom icinga and nagios in the future, for alertmanager and prometheus [10:39:43] but still a long way to go :) [10:39:50] going afk for lunch! ttl [10:59:41] Alright, 1004-1009 all repbooted, and updated the task. Will watch things for a bit and then go for lunch [13:13:54] updated the task with my reboots as well! [13:14:04] I think that we can split the ml-etcd clusters as well [13:15:09] going to reboot the ml-etcd2* nodes in codfw [13:15:28] I have just checked sudo etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health and then rebooted [13:15:31] nothing more [13:16:51] Will do eqiad in a hot second [13:20:31] there is also https://grafana.wikimedia.org/d/Ku6V7QYGz/etcd3?orgId=1&var-site=codfw&var-cluster=ml_etcd&var-instance_prefix=ml-etcd [13:20:34] that is nice [13:21:39] Neat [13:21:46] also: eqiad done and task updated [13:22:21] super [13:33:14] ml-etcd2* done! [13:43:02] 10Machine-Learning-Team, 10artificial-intelligence, 10Edit-Review-Improvements-RC-Page, 10Growth community maintenance, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10MShilova_WMF) @MMiller_WMF , Yes. I added a calendar reminder about this and... [13:47:22] I am going to reboot again the nodes that I did this morning, the kernel was not yet installed :( [13:47:33] the ones that Tobias did are ok :) [14:00:57] Oh, interesting. I thought the ticket said the packages had been deployed fleetwide [14:02:48] yeah I started the reboots when the packages were not all rolled out, usual luck :D [14:03:16] klausman: if you have time and want to do ores100[1-3] please go ahead! [14:29:14] I am trying to adapt the deployment-charts code for the istio mesh and it is more difficult than expected :D [14:35:18] I'll do 1-3 in a second (currently on the phone for citizenship things) [14:47:47] doing them now [14:50:33] thanks! no rush [15:26:43] Dammit, have to do them again, since I, too, did not check what kernels were installed :D [15:27:49] nono they were! [15:28:03] all ores nodes are fine now [15:28:10] it was only my reboots [15:28:20] (just checked with Moritz) [15:29:21] You sure? it still said revision 18 [15:29:30] whereas the newest kernel is 20 [15:29:55] Huh, you're right [15:30:06] for some reason, the etcd's got revision 20 [15:30:37] all our reboots should be done in theory [15:30:57] ~ $ ssh ml-etcd2001.codfw.wmnet uname -srv [15:30:59] Linux 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) [15:31:01] ~ $ ssh ores1001.eqiad.wmnet uname -srv [15:31:03] Linux 4.9.0-18-amd64 #1 SMP Debian 4.9.303-1 (2022-03-07) [15:31:11] I guess it's because they're Buster nodes? [15:31:26] yep [15:31:38] Well, 1001 got a superfluous reboot then :D [15:33:51] less dust! [15:34:14] yes, let's go with that [16:14:03] ok I have a draft of all the changes to enable the istio mesh, only 3 code reviews :D [16:14:15] starting from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/775343/ [16:14:27] will have to re-review them but overall they should work [16:16:02] going to take a little break before meetings