[09:00:27] * dcaro sad for redis [09:02:22] morning [10:43:06] coincidentally, Microsoft open sourced a cache store three days ago and which "can work with existing Redis clients" https://github.com/microsoft/garnet [10:43:27] it is MIT Licensed [10:44:26] same protocol, entirely differnet backend with claims of improved performances [10:50:18] written in C#, how microsofty :) [10:52:25] toolsbeta puppetdb postgres OOMd which is why you're seeing all these puppet failure alerts. I restarted it. let's figure out a better fix if it happens again [10:55:24] I thought andrew was still playing with puppet7 [11:08:23] I'll look into the alerts then [11:21:32] it seems nova on cloudcontrol1007 died sometime tonight, the last log is about sqlalchemy failing to connect to the DB (ssl timeout) [11:51:04] fyi. I'm playing with ceph (taking osds out, and in for the hard drives testing issues), there should be no issues, but let me know if you see anything weird going on [11:51:20] (I'm being careful and adding/removing osds one by one, rebalancing, adding next...) [11:52:02] ack, thanks for the heads-up [12:10:53] I'm resizing the tools-static host, minor service disruption expected [12:12:27] done [12:34:28] andrewbogott: T360626 [12:34:29] do you think that will stop the flakiness/ [12:34:31] T360626: Frequent radosgw 500 errors with OpenTofu - https://phabricator.wikimedia.org/T360626 [12:34:36] I'm hoping it will [13:32:09] do we have a cookbook of sorts to delete old node certificates? (though it might be a bug, we seem to have suddenly >800 `Found non-revoked Puppet certificates for 807 deleted instances on cloudinfra-cloudvps-puppetserver-1`) [13:32:59] there is a script, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013240 fixes it for puppet 7 [13:33:01] I can get them from /var/lib/prometheus/node.d/openstack_stale_puppet_certs.prom nice [13:33:08] however that alert is broken for the cloud-vps wide puppetserver [13:33:16] the exporter only works on project-local puppetservers [13:33:27] that alert should not have been applied to the cloud-vps wide puppetserver in the first place [13:34:19] oh, let me look then why it was applied, might be just a leftover (if it was applied by mistake and then fixed) [13:34:45] shouldn't we be cleaning those up anyhow though? [13:34:53] it's probably using `role::puppetserver::cloud_vps_project` instead of a dedicated role without the exporter applied [13:34:56] there's mostly fullstack VMs [13:36:07] there's a separate script (wmcs-puppetcertleaks) for that, which takes into account that clients can be in some other project than the puppetserver itself [13:36:20] ack [13:37:10] yep, that role is applied [13:37:12] https://www.irccloud.com/pastebin/0DCBYnTY/ [13:37:19] I'll make the puppet patch [13:40:35] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013312 [13:40:42] that's it right? [13:41:30] maybe make the role name ::cloud_vps_global just to make it super clear it's not the per-project one? otherwise LGTM [13:41:40] ack [13:44:00] +1d [13:44:35] thanks :) [14:00:09] alert gone \o/ [14:21:23] taavi: Do you think radosgw just now started doing that, or is it just a new use case so you're only now noticing it? [14:21:53] andrewbogott: it's been a while since I last did anything with tofu/terraform, but I don't remember it doing that when I set it up [14:22:23] ok. It /could/ be due to dcaro rebalancing things although it's pretty awful behavior if so [14:25:05] right now the cluster is done rebalancing, is still hapenning? [14:25:51] I have not seen any slow-ops either during the rebalance [14:28:36] taavi: regarding the toolsbeta puppetmaster OOM, I'll look at resizing it. Otherwise I regard the puppet migration there as finished. [14:28:53] can you also remove the old VMs then? [14:29:17] yes. [14:29:44] dcaro: it seems like it's intermittent, but I still get failures here and there [14:29:57] taavi: they're shut down already [14:30:33] taavi: then the rebalancing itself is not the cause (might increase the frequency of errors though) [14:39:34] taavi: I enlarged the toolsdb servers in toolsbeta and tools. Hopefully they'll keep up now. [15:18:19] is anyone having laggy network sshing to ceph/toolforge? I just added some drives, and got kicked out of the restricted bastion [15:18:28] andrewbogott: bd808: have either of you given any thought on what to do with labtestwikitech if/when OpenStackManager is no longer needed on wikitech? [15:21:20] there were a bunch of pings lost between ceph hosts [15:21:23] https://usercontent.irccloud-cdn.com/file/UxuYmkI8/image.png [15:23:21] and those are not only coludcephosd1030 (the one with the new drives :/) [15:23:30] so I'm suspecting some switch overload [15:23:45] taavi: I would assume that we will still have bot integrations with wikitech like the ones that make the Nova Resource namespace pages after OSM is gone, but I haven't thought about it too much. [15:23:45] topranks: ^ are you around? can you help debug? [15:23:57] * topranks here [15:24:15] network seems better now though [15:24:24] bd808: I think we can migrate the namespaces to wmf-config.git. I'm more curious on how accounts for codfw1dev will be created in the future [15:25:06] topranks: I added 5 drives to cloudcephosd1030, and the moment they started rebalancing, the network started misbehaving (I lost connection to the host/timeout) but not only from-to that host (see screenshot) [15:26:52] The graph is here https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?forceLogin&from=now-30m&orgId=1&to=now (in case you want to play with it) [15:27:26] yeah looks like the link from cloudsw1-f4 to cloudsw1-d5 saturated [15:27:39] what's the current status? [15:28:08] things seem better now [15:28:21] the bw seems to have subsided a little [15:28:35] https://librenms.wikimedia.org/graphs/to=1711034700/id=25230/type=port_bits/from=1711031100? [15:29:12] the real answer to this is to add the qos config to ensure that the bulk data traffic is not able to squeeze out the heartbeat traffic, or other stuff like ssh [15:29:33] I'm in the process of reworking that at the moment but it's not set up yet on the cloudsw [15:29:46] yep, that's awesome yes [15:30:08] the cloudceph hosts are still using iptables are they? [15:30:53] I think so yes [15:30:56] I had a meeting with moritz on it earlier, I had been targetting the new puppet resources at nftables only but I may need to also include support for iptables [15:30:59] taavi: As long as I'm dreaming, I'd like wikitech to become boring enough that I never think about it and we no longer need labtestwikitech. That would require some other account-creation workflow for codfw1dev though. [15:31:25] not a huge deal, it may be possible to move the ceph hosts to nftables, if not its not huge extra work to also support for ferm/iptables [15:32:04] on ceph side I think it does not care, it does not manage rules itself, so should be relatively easy I think [15:33:48] yeah I looked into it before, ceph already marks the packet DSCP so it should be easy to implement [15:34:13] i was working on it more this week, probably early next quarter I'll be able to roll out [15:34:23] until then we may get these incidents if there is a large rebalancing :( [15:34:33] awesome, we have a bunch of hosts coming in [15:34:49] I'll have to add the drives one by one (add drive, wait for rebalance, add drive...) it might take a while [15:34:51] the pattern probably like this - large spike in usage which upsets things, but then it settles down after a few mins [15:35:00] ok let's try to avoid that [15:35:15] when are you expecting to get the new drives? these are replacing the dodgy ones from Dell? [15:35:49] sorry new hosts you said not drives [15:36:05] these were 5 that they sent yes, but before the end of next fiscal, we will get a bunch of hosts (hopefuly ~12, with 8 drives each) coming in [15:36:16] just 5 drives, not hosts [15:36:21] ok [15:37:35] cloudcephosd's are all on buster are they? [15:37:56] there's a few in bullseye [15:38:18] we were waiting for the hard drives to get fixed before the reimage/upgrade [15:38:24] (less moving pieces) [15:38:30] ok yes [15:38:40] it's taking a while though xd [15:38:51] any idea why the f4<->d5 shows smaller here: https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?forceLogin&from=now-30m&orgId=1&to=now [15:38:59] (I might be getting the units wrong) [15:39:12] specifically https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?forceLogin&from=now-30m&orgId=1&to=now&viewPanel=107 [15:42:12] not sure - I did notice that in LibreNMS too [15:42:40] the traffic is all routed so it *should* take the most optimal route [15:43:19] assuming all the hosts are in rack-specific vlans/subnets [15:45:09] I think so let me verify [15:45:24] I see the same traffic from f4 to c8 and d5? [15:45:53] the vlans are set on netbox for all ports on that switch [15:46:02] https://netbox.wikimedia.org/dcim/devices/3935/interfaces/?page=1 [15:47:13] actually I checked - it ought not to matter [15:47:33] librenms reports different data for ports -54 (f4-d5) and -55 (f4-c8), but on grafana I'm getting the same value, I might be doing some bad matching [15:47:38] there are cloudceph hosts on the legacy "cloud-host1-eqiad" vlan [15:47:43] they are in racks D5 and C8 [15:48:04] however from rack F4 (or E4) the traffic to that subnet should be load-balanced 50/50 [15:48:17] i.e. it F4 will send half traffic to C8, half traffic to D5 [15:48:32] silly me, I'm getting the wrong port on grafana xd [15:48:49] oh, that's interesting [15:49:56] I rarely look at grafana for that as I never lernt the graphite query syntax [15:50:10] (hoping to get them in prometheus via gnmi soon also) [15:50:27] LibreNMS ought to be correct - although misses peaks cos of 5min sampling [15:51:09] afaik, the data in grafana comes from librenms, so it ends up being similar or worse xd, just fixed the interfaces and there's the bump [15:54:41] yeah you're right [15:55:01] we've been pulling some stats using GNMI from network devices and pushing to grafana [15:55:02] https://grafana-rw.wikimedia.org/d/5p97dAASz/cathal-network-queue-stats [15:55:33] in prep for the QoS roll-out. I notice the cloudsw are missing there though, I think there is an issue with this export on the QFX5100 devices in c8/d5 but I'll check up and see can we get it added [15:56:02] that'd be great yes! [15:57:01] taavi: which host was failing to sync w/pcc? [15:57:46] at least tools-puppetserver-01 [16:04:03] oh right, with puppetdb [16:04:10] different script than the one I rewrote [16:05:25] topranks: do you have a task for both the qos and the stats I can follow? [16:06:02] yeah the stats is here: [16:06:02] https://phabricator.wikimedia.org/T326322 [16:06:17] let me try to dig out the QoS one, likely it needs an update from me :) [16:06:49] 👍 [16:07:08] the basic qos one I'm gonna track the rollout in is this: https://phabricator.wikimedia.org/T339850 [16:07:57] and some info on wikitech: https://wikitech.wikimedia.org/wiki/Quality_of_Service_(Network) [16:17:14] hmm, yep, it seems the traffic between c8<->d5 has been high at times lately, hitting the saturation (and forcing some package drops), this is from me taking out a few drives the two previous days [16:17:16] https://usercontent.irccloud-cdn.com/file/wQ5psD7Y/image.png [16:17:27] those spikes on that same interface [16:17:49] I'll try to be nicer to the switches :S [16:42:41] yeah this is on us too though [16:42:55] dcaro: we are attending a buildpack session at kubecon, and they are showing kpack [16:42:59] but thankfully we have something in the works, we will prioritise roll-out to the cloud switches given the work you're doing [16:43:25] arturo: are they vmware engineers? [16:43:28] it seems like a really nice tool, not sure if is better/easier than pack&tekton, our current implementation [16:43:49] I did a run on it, it's a bit more complicated [16:44:00] Check this out! (I'm excited to attend "Container Image Workflows at Scale with Buildpacks - Juan Bustamante, Broadcom & Aidan Delaney, Bloomberg" at KubeCon + CloudNativeCon Europe 2024. https://sched.co/1Yhj6. #) [16:44:13] they are not vmware engineers [16:45:11] the broadcom guy you'd assume hates vmware :P [16:45:31] I know Juan [16:45:44] broadcom nowadays is vmware [16:45:51] (iirc) [16:45:57] otherway around :) [16:46:06] oh yes xd [16:46:17] he was at redhat before iirc [16:46:27] he is on stage ATM [16:46:44] https://usercontent.irccloud-cdn.com/file/w0A1V8TC/irccloudcapture5275268527025322100.jpg [16:47:59] kpack had the nice thing that you kind of define your own builder with the buildpacks you want (builder + image are CRDs there), and it will build it on the fly [16:48:53] it's quite a different approach from tekton, that is agnostic to buildpacks essentially, and just runs a pipeline [16:50:42] if your user were expected to be creating builders and buildpacks and such, it would make a better case (ex. if we would have internal infra/developer teams using our platform) [16:52:10] dcaro: as the talk is progressing, it is more clear they are "selling" kpack over the competitors [16:53:39] xd [16:53:55] I am generally interested in creating builders, but not sure when I would get time to dig into it. My idea is about making a builder that installs MediaWiki and a user defined list of extensions. Basically a next generation base for . [16:54:34] bd808: you mean a buildpack? or a builder image (buildpacks + run image + build image)? [16:55:46] dcaro: yeah, probably a buildpack to add to some other stack. I lost the local vocabulary in the multiple years since I went deep on the tech. :) [16:56:24] it's quite confusing xd, I'd love to see a buildpack for mediawiki [17:30:02] * dcaro off [17:30:04] cya tomorro [17:30:08] *tomorrow [18:18:12] * bd808 lunch [19:14:09] taavi: can I delete any of these? taavi-pm-1.testlabs.eqiad1.wikimedia.cloud, taavi-grafana-2.testlabs.eqiad1.wikimedia.cloud, dbusers-nfs-1.testlabs.eqiad1.wikimedia.cloud [19:14:15] (looking for easy buster VMs to eliminate [19:14:16] ) [19:16:19] andrewbogott: I don't know about dbusers-nfs, the rest can be deleted [19:16:28] great! [19:27:14] bd808: this is a long shot, but do you have any idea/recollection what clouddb-wikireplicas-query-1.clouddb-services.eqiad1.wikimedia.cloud is? [19:27:26] It looks to me like it hasn't done anything in years [19:27:53] (same question to dr0ptp4kt) [19:31:47] At an appointment, but will respond later. Andre had pinged on a task and I just saw it after going through email backlog. I think this is the server where I was running queries against the replicas to see query volume by user, but will need to check when at a comp. Will email myself to check. [19:33:37] thanks Adam [19:33:52] task is T359810 [19:33:54] T359810: Are clouddb-wikireplicas-query-1 and the cloudb-services project still useful? - https://phabricator.wikimedia.org/T359810 [19:40:00] yeah, this was the server I was using in T345211. I don't need it anymore for that task and am not using it for other tasks. I'm less sure about any others who may have been using it interactively or with a scheduled job or whatever in the near past. [19:40:01] T345211: Re-create querysampler database ID for Wiki Replicas clouddb databases - https://phabricator.wikimedia.org/T345211 [19:41:04] dr0ptp4kt: I will probably just shut it down and wait to see if anyone appears to turn it back on [19:41:05] thanks [19:41:50] thanks andrewbogott ! [19:46:37] andrewbogott: sounds like d.r0ptp4kt gave you what you need. It looks like bstorm created the instance in the way, way long ago to run that query sampler script she dreamed up for T272723. [19:46:38] T272723: Create a way to sample wikireplicas usage data - https://phabricator.wikimedia.org/T272723 [19:47:31] hmmm that seems potentially useful but I don't think it can have still been working [19:48:19] https://phabricator.wikimedia.org/T272723#6802154 implies this was all before the most recent Wiki Replicas rebuild. [19:48:38] yeah [19:48:47] it would probably been a total rebuild to do anything useful today [19:49:09] * andrewbogott assume 'a small amount of love' means 'total rebuild' [19:49:12] April 2021 was when https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign went live [19:50:41] As I recall that sampling script was to help Brooke and Joaquin guess how many things were going to be broken by the breaking change of putting each section on its own mariadb instance. [19:51:05] yeah, that's how I read it too -- research about designing the new setup. [19:51:22] why does 2021 feel like 100 years ago? [19:52:04] I feel like 2020 was last week but 2021 was a decade ago [20:00:17] I think I generally agree andrewbogott, but only February 2020 was recent. March 2020 was much, much earlier somehow [20:00:38] #LockdownAmnesia [21:01:44] * andrewbogott wonders if taavi has heard of Saint Urho https://en.wikipedia.org/wiki/Saint_Urho [21:23:01] Some fun new reading material: https://wikitech.wikimedia.org/wiki/News/Buster_deprecation, https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge <- edits and comments welcome [21:23:22] I had grand plans to better automate the annual purge but I think we just need to get on with it for this year [21:33:46] https://os-deprecation.toolforge.org/ reminds me that I should find co-maintainers for more projects. :/ [21:45:19] yeah, lots of looming disasters there [23:47:39] andrewbogott: (not urgent) I was going to try and fix the borked Security Group reported in T360694, but I have found that my OpenStack user apparently isn't able to edit security groups for projects where I'm not an explicit member. Is this a known thing or a bug against config that I should file? [23:47:40] T360694: Reset default Security Group rules for the openvas Cloud VPS project - https://phabricator.wikimedia.org/T360694 [23:48:58] * bd808 self-unblocks by joining the project [23:54:06] I recall hitting that at some point, let me see if I filed a task [23:54:40] T348582 [23:54:40] T348582: Neutron policy does not allow the admin role to modify security groups - https://phabricator.wikimedia.org/T348582