[08:46:28] morning! can I get a +1 on T351051? [08:46:29] T351051: Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 [08:50:44] whoever had applied local changes to cloudcumin1001 /srv/deployment/wmcs-cookbooks, please do not do that again as that will cause puppet runs to fail and create unnecessary an alert [09:23:50] jbond: are you aware that agent7.puppet-dev.eqiad1.wikimedia.cloud has been sending puppet failure emails for quite a long time [09:34:50] taavi: ack will take a look to day cheers [10:20:26] I'm trying to add a IDP OIDC service to idp.wmcloud.org, as well as an accompanying secret. I don't appear to have access to modify the Hiera Config in Horizon, nor can I find the right place to add the secret [10:20:48] I can provide the snippet that goes in "profile::idp::services" [10:27:34] slyngs: I can take care of that, is there a task with details? [10:28:01] Yes, https://phabricator.wikimedia.org/T350725 [10:28:14] I'll just add the configuration snippet to the task [10:34:21] slyngs: done! [10:35:34] Thank you [11:06:42] topranks: the cloud-support1-a-eqiad vlan is now empty btw, we probably want to remove it entirely? [11:06:57] also I noticed https://netbox.wikimedia.org/ipam/prefixes/153/ in netbox, we can probably free that up? [11:07:44] taavi: cool! [11:08:02] Yep sounds right I’ll take a look shortly and tidy things up [11:08:07] thanks! [11:09:13] also also, T300427 is discussing changing how we do haproxy for the wiki replicas, that might also be a good opportunity to move it to use the cloudlb hosts and network setup [11:09:13] T300427: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 [14:39:14] hi all, i'd like to start migrating the wmcs production roles to puppet7. Am i safe to start migrating all of the codfw1dev cluster? (cc andrewbogott balloons) [14:39:35] we have allready migrated ~15% of production host and don't expect any issues but you never no [14:42:31] jbond: iirc libvirtd on cloudvirts uses puppet certificates for authentication [14:42:52] taavi: due to a migration mishap I need to cycle power on tools-docker-registry-06 [14:43:10] it'll be broken for a bit, something is going wrong with the cinder volume there [14:43:40] jbond: but otherwise I think that's fine to start migrating a few roles at once [14:43:58] andrewbogott: ok, do you need input from me or is that a FYI in case it pages? and is that the primary or replica node? [14:44:09] taavi: can you point me to the puppet code where that may be? [14:44:18] taavi: that question (primary/replica) was going to be my next question. [14:44:27] Do you havea second to check while I try to rescue it? [14:44:36] jbond: profile::openstack::base::nova::compute::service, around line 100 [14:44:42] taavi: thanks <3 [14:45:22] andrewbogott: good news! it's the secondary [14:45:29] oh good [14:45:41] that will help with my panic :) [14:45:43] (the floating IP display I built into https://openstack-browser.toolforge.org/project/tools a while ago is proving itself very handy :P) [14:55:45] argh this vm is really determined to stay on cloudvirt1030 [15:02:40] taavi: ok, finally, it's moved and the volume is re-attached. [15:03:10] (this is in service of T351010) [15:03:10] T351010: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 [15:05:09] dhinus: Those hosts are now all drained and I'll decom them sometime soon; you can ignore for the same of reimages/upgrades. I also reimaged a couple of high-number cloudvirts but there are still lots left to do. [15:07:23] what's the best way to proceed? shall I pick a random one and start draining it? [15:08:25] if I drain one while you're around I'm feeling more confident in case I have questions :) [15:08:55] dhinus: I don't think the order matters much. Maybe... ascending order? [15:09:31] makes sense, let me recheck the list and which one you've reimaged already [15:09:42] according to cumin, cloudvirt[1028-1029,1031-1036,1038-1050,1052-1056].eqiad.wmnet,cloudvirtlocal[1001-1003].eqiad.wmnet [15:09:46] are all on bullseye still [15:10:03] Hm... [15:10:23] It would be /slightly/ more efficient if we depool all the bullseye hosts first [15:10:28] so we don't move any VMs extra times [15:11:19] although mostly VMs will get scheduled on the newly-reimaged hosts anyway because they're the empty ones [15:15:33] hmm I'll try draining 1031, and let's see if the VMs all go to the bookworm hosts or not [15:17:57] andrewbogott: I don't see any SAL messages about the hosts you drained in https://phabricator.wikimedia.org/T345811 [15:18:11] but I assume the command you used is cookbook wmcs.openstack.cloudvirt.drain --fqdn cloudvirt1031.eqiad.wmnet ? [15:18:34] well that makes me wonder what other innocent phab task I spammed with those [15:18:39] hahaha [15:18:47] but yes, that's the command [15:19:08] although since taavi (wisely) reverted the cookbook on cloudcumin1001 you'll need to break downtiming again I think [15:19:15] hm... maybe not for draining? I don't remember [15:19:18] it'll be obvious [15:19:45] (sorry btw taavi, this is an ongoing issue where the actual cookbook source mostly doesn't work without hacks) [15:29:38] I started it locally on my laptop [15:29:56] to avoid the cloudcumin issues with amtool [15:37:11] hmm a few VMs are ending up in 103x :/ [15:39:26] 1031 is drained, is the next step simply launching the reimage cookbook, or do I need to anything else? [15:39:37] afk for a bit, I'll be back for the checkin [15:42:22] dhinus: let me check something... [15:42:31] well, to be clear, I'm going to make sure it's not in the 'ceph' aggregate [15:43:04] it isn't, the drain script must've taken care of that [15:43:08] so, yeah, go ahead and reimage [15:43:28] (and as always, manually delete the canary so it doesn't wind up an un-hosted zombie) [15:43:50] right, I think deleting the canary ended up in an Icinga alert last time [15:44:01] did you manually add a downtime in icinga? [15:45:49] If you reimage from e.g. cumin1001 it should do the downtiming. [15:46:08] Um, oh, right -- if you delete the canary /after/ you start the reimage then it wont' alert [15:46:13] Sorry, this is kind of silly [15:46:35] I am not 100% sure that keeping canaries is worth the trouble [15:50:15] will horizon let me delete it if the cloudvirt is down? [15:50:22] i.e. after I start the reimage? [15:51:25] or to put it in another way, if I start the reimage and nuke the canary, how do I tell openstack to remove it from its db? [15:53:19] dhinus: horizon will let you [15:53:34] ok then it seems easier to do that after the reimage has started, rather than before [15:53:55] I was afraid the canary would enter some limbo where it cannot be deleted because it doesn't exist :) [15:53:57] yep, I'm pretty sure that works [15:54:01] let's try! [15:55:27] hmmmm is "Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4)" something important? It sounds important. [15:55:48] (same alert on haproxy-3 [15:55:51] ) [15:58:04] taavi, know if that's expected? [15:58:50] prometheus shows it was quite stable until today [15:58:54] so yeah maybe something happened [15:59:52] there's a runbook here https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy [16:00:13] I'll try restarting the tool as mentioned there [16:01:05] thanks! I'm logged in to the hosts but hadn't got that far [16:03:38] restarted the "admin" tool, I see the metric going up [16:04:18] the test pages look right [16:04:43] recovered on alertmanager [16:04:59] thanks dhinus, will keep an eye out to see if it dies again [16:05:32] the alert is still firing, but the metric looks fine in Thanos [16:06:19] and now the alert is ok too [16:09:52] deleting the canary from horizon while the host is being reimaged worked fine [16:10:08] hm, but now http://checker.tools.wmflabs.org/k8s/nodes/ready shows 'failed' [16:10:19] hmmm [16:11:20] that alert points to a different runbook :) https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:11:33] yeah, don't know if this is related [16:12:02] lots of failed services on tools-checker-04.tools.eqiad1.wikimedia.cloud [16:12:07] I'll see if they can be restarted [16:12:47] now the same URL shows "OK" [16:12:49] I didn't do anything [16:13:03] it's flapping actually [16:13:58] now it's OK [16:14:23] I'm not sure these failed units are related, they're failing because of missing .ini files [16:16:04] I wonder if it's just noticing that worker nodes are migrating? I wouldn't expect that to cause an alert because they're only offline for a split second [16:16:14] I was thinking that might be possible yeah [16:20:17] can someone help with a toolforge question? I'm logged into toolsbeta and trying to hit the following endpoint `curl --cert ./.toolskube/client.crt --key ./.toolskube/client.key --insecure 'https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/builds-api/pods/builds-api-7976896db5-gm295:9000/proxy/metrics'` but I'm getting forbidden error unable to access resource `pods/proxy`. Apparently the [16:20:17] users I tried using for this don't have the permission to access the `pods/proxy` resource. The thing is prometheus can do this, but I have no idea how that works. [16:20:17] Anyone knows what user I can use to be able to access that endpoint on toolsbeta ? (I've tried using user and most of the projects in /data/project) [16:23:51] Raymond_Ndibe: try adding `-H "Impersonate-User: raymond-ndibe" -H "Impersonate-Group: system:masters"` to the command line, while being logged in as your own user [16:31:46] ok thanks taavi let me try that [16:33:50] taavi: it worked, thanks! [16:53:47] I got a question from Tajh over the weekend about https://toolviews.toolforge.org/ being "down". It turned out that the tool was running as expected, but ToolsDB was so overloaded that queries to the toolviews table were taking minutes to respond. Now I'm wondering if toolviews should move to it's own DB or not. Thoughts? [16:55:11] bd808: what kind of queries is it doing? how are those queries indexed? [16:58:09] The queries are pretty boring stuff like . The schema is at https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/blob/main/schema.sql [16:59:55] Having indexes doesn't help much if the DB server is too overloaded to page them into memory [17:00:59] last minute ping: anyone have any topics I should raise at the SRE meeting? [17:02:38] bd808: another option could be pointing it to the toolsdb replica (unless it's using that already, but I don't think so) [17:03:08] but moving it to a Trove instance is also an option [17:04:04] btw, where did you see toolsdb was "overloaded"? I lowered some cache last week to address the OOM errors, and I'm still debugging that [17:04:25] I wonder if lowering the cache had an impact on toolsviews or not [17:04:29] dhinus: I could try pointing at the replica if it has a stable hostname. [17:05:19] bd808: kinda :P it's been stable for 2 years, but it's likely to change in the future, I would really like to create a stable DNS for the replica, like we have for the primary [17:05:25] dhinus: I don't have a definition of "overloaded" other than a query like https://toolviews.toolforge.org/api/v1/day/2023-11-12 normally responds in milliseconds and on Sunday it was taking 90-120 seconds. [17:06:05] is that still the case today? it would be interesting to track if it's always slow or only on certain moments [17:06:15] it's back to millisconds [17:07:05] I'm pretty sure it's only slow when something is actively hammering the db. [17:07:41] it's possible but that's quite a dramatic change from milliseconds to 90 seconds [17:11:01] Setting up some repeating query to measure latency is hard because adding such a thing changes how the db engine will cache things. It would likely still show CPU contention, but RAM contention would quite likely be hidden by the test itself. [17:13:49] yeah I see. let's see if this happens frequently, it might also indicate ToolsDB is getting overloaded in general and that might affect other tools and not just toolsviews [17:13:55] toolviews is both write seldom (only written when front proxy access logs rotate) and read seldom so I think it's ideal for ToolsDB, usually. [17:15:38] The OOMKiller problem seems to indicate that something changed in ToolsDB usage in the recent past. Tracking down what is of course not simple at all. :/ [17:16:13] yes agreed, we didn't change anything else so it seems something has changed in the usage [17:17:27] lol. the tables for toolviews itself could just be kept in ram in the webservice :) https://tool-db-usage.toolforge.org/owner/s53734 [17:17:32] I still think MariaDB should handle it more gracefully, slower queries are fine (and can be tracked), but eating all the memory is weird [17:18:25] I have now enabled slow query logging in ToolsDB, that was never enabled in the past. I haven't looked into all the details, but at first glance I didn't see anything too serious in the log [17:18:50] s/in the past/in the last 12 months/ :P) [17:19:23] andrewbogott: cloudvirt1031 is back in the "ceph" aggregate [17:19:39] Brooke did some sampled logging on ToolsDB at some point I think, but that would have been quite a while ago. [17:19:42] great, did you try the canary cookbook? you can specify just the one host [17:19:45] (the unset_maintenance cookbook worked, but I had to specify the aggretate in the params) [17:20:29] running the canary cookbook now (on 1 host) [17:22:33] done [17:23:04] I'm logging off shortly, andrewbogott if you want to reimage a couple more cloudvirts [17:23:16] yep, ok! Have a good evening [17:23:21] thanks [17:27:01] * dhinus off [17:58:58] Seeing various nftables things making progress is making me a bit sad in that Arturo is gone now. T187994 for context. [17:58:59] T187994: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 [19:12:46] * bd808 lunch [19:39:49] jbond: I'm sorry if you answered these questions already this morning in the meeting: 1) Do you expect reimaging hosts with puppet7 to work today? 2) will moving hosts from puppet5 to puppet7 require a reimage? [19:42:00] andrewbogott: yes rimaging to puppet7 works; no you dont need tore-image to migrate [19:42:08] ok [19:42:32] I currently get a cert-signing failure if I try to use the re-image script with 7 but that's not a problem for me if I can just keep using 5. [19:42:39] (assuming 5 still works which I don't know quite yet) [19:44:41] andrewbogott: if yu want to reimage a new host to puppet7 you will need to add the following hiera entries first [19:44:44] profile::puppet::agent::force_puppet7: true [19:44:47] acmechief_host: acmechief2002.codfw.wmnet [19:44:59] Ah, I see! Likely that's what was missing. [19:45:00] thx [19:45:14] if you are reimaging a host to puppet7 if should guide you to adding theses values at the right time [19:45:37] if you want to migrate a host without reimage use sre.puppet.migrate-host [19:45:47] which will also tell you when to add the hiera entries [19:45:59] andrewbogott: fyi you also said you would merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/968665 [19:46:06] it shld be good now [19:46:54] done [19:47:07] cheers [21:39:17] * andrewbogott afk while some very slow cookbooks run, back later in the evening [22:25:34] Did some hypervisor rebuild just nuke a grid engine node? My inbox is full of "failed before writing exit_status: shepherd exited with exit status 19: before writing exit_status" failure notices from grd jobs. [22:26:46] see my !log just a few minutes ago :-P [22:27:09] mystery solved. thanks taavi