[04:58:52] <_joe_> milimetric: (old jaded SRE opinion) the advantages on debian are purely theoretical (I can install a minimal debian in less than 200 MB to run k8s with minimal security dependencies) while the disadvantages are asbolutely real (there is no way this distro has the expertise, skill and resilience of the debian community) [05:00:05] <_joe_> not to mention all of our tooling is geared towards debian [05:17:47] <_joe_> oh also the more I read the more I hate this [05:18:31] <_joe_> "doesn't have a shell" touted as a security feature. [07:04:45] 🫡 [07:07:04] There is this warning on EMEA saying: "There are no routing keys for this policy - it will only receive incidents when escalated to by another policy. " 🤔 It doesn't say it for americas [07:29:21] scap is complaining that kubernetes2010 is down. the mgmt interface seems unreachable too [07:49:16] !incidents [07:49:17] 4074 (RESOLVED) HaproxyUnavailable cache_text global sre () [07:49:48] restbase complaining [07:50:14] restbase1020 restbase1022 [08:08:47] <_joe_> jynus: there is no traffic flowing there [08:09:55] ah, true [09:42:47] people.wikimedia.org has been switched from eqiad to codfw, please use people2003.codfw.wmnet for new uploads. See mail in sre-at-large for more information. [09:49:47] Sorry to bring this up again, but I hope I could get your attention on some of the ongoing criticals- some look like simple expired downtimes of ongoing maintenance (puppetboard[12]002 & an-worker1086), if I could suggest checking if you can help downtime known, long term issues to separate them from new ones? https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre [10:10:04] <_joe_> btullis jbond ^^ [10:51:53] <_joe_> cwhite: I have the patches lined up for T343025, we should have prometheus-statsd-exporter as a sidecar installable in the mediawiki pods with those [10:51:54] T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter - https://phabricator.wikimedia.org/T343025 [11:42:15] !incidents [11:52:42] (sirenbot didn't come back for me after the net split/net issues :-() [11:56:35] btullis: no need to apologize- thank you -if you can help categorizing that better at a later time and it only pings your team I promise I won't annoy you more he he :-D [11:57:34] sadly I understand some checks are too intertwined to easily separate between teams [11:58:06] !incidents [11:58:06] 4076 (RESOLVED) ProbeDown sre (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 eqiad) [11:58:07] 4074 (RESOLVED) HaproxyUnavailable cache_text global sre () [11:58:15] Just needed a kick in the unit [13:19:57] <_joe_> jbond: didn't we have a tool to make queries to puppetdb? [13:20:05] <_joe_> I can't find it in my email history [13:22:02] _joe_: jbond is out sick today [13:22:12] I've done a bunch of curl by hand on the puppetdb hosts [13:22:13] <_joe_> oh apologies [13:22:22] <_joe_> cdanis: yeah I thought we introduced a cli tool [13:22:25] hmm [13:22:31] I would love to know about it if we did [13:23:54] All I can find is the pdb-change tool, and a small PQL python skeleton in wikitech [13:24:05] But it rings a bell, depite my inability to find it [13:24:48] some limited things can be done via puppetboard, and ofc the simpler ones via cumin, but I bet you need something more complex [13:25:26] and yeah AFAIK john suggest to use PQL that is quite powerful, but I can't say intuitive (not that puppetdb was intuitive in any way...) [14:40:19] XioNoX topranks o/ I'm seeing some unexpected changes on the automatic netbox hiera sync during host decom. ok to proceed with this? https://phabricator.wikimedia.org/P52617 [14:43:14] herron: yep those are not prod yet [14:43:21] XioNoX: ack thanks! [14:49:30] sorry for the noise Keith yep they're fine to go, forgot to run the cookbook [14:55:34] topranks: thanks! [15:24:39] XioNoX, topranks o/ - if you have a moment - I have been wondering what is the purpose of the InterfaceErrors alerts (see https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DInterfaceErrors) [15:25:10] every time that I check them I see some spikes in the graphs that end up causing alerts being visualized for a while [15:25:18] hmm [15:25:49] elukey: the overall purpose of them is to warn on network interface errors, typically physical errors on the link [15:25:57] should we get alarmed when there is a sustained amount of errors, or even spikes? (genuinely asking to understand) [15:26:04] (could be bad NIC, cable, switch port possibly, but should be an unusual scenario) [15:26:25] could it also be NIC saturation? [15:26:28] elukey: my thinking is yes - but that should be very rare, so I need to dig into the metrics and try and see what's happening [15:27:01] topranks: okok to be clear I am just asking to help in the follow ups, I am not questioning them :) [15:27:15] elukey: it shouldn't be, unless we are including "discards" in the count, which are often just "coudn't send the packet, link is busy and buffer full" incidents, not a problem with anything as such [15:27:29] yeah it's a good question, let me look into how it's set up [15:28:40] for example, analytics1077 [15:28:43] https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics?viewPanel=72&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-node=analytics1077:9100&var-disk_device=All&var-net_dev=enp130s0f0&from=now-3h&to=now [15:29:07] the sharp spike seems a pattern for an* nodes afaics [15:35:04] we should update the doc once we know more: https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:36:58] yep good call [15:37:17] Looking at that node, Ethtool seems fine [15:37:23] https://www.irccloud.com/pastebin/hMSOW5ON/ [15:37:31] these are really the kind of errors we want to catch [15:37:58] Looking at the stats from 'ip' it reports fifo and overrun errors: [15:38:06] https://www.irccloud.com/pastebin/ZWgVxmWb/ [15:44:58] rx_fifo_errors is documented here: [15:44:59] https://www.kernel.org/doc/html/latest/networking/statistics.html?highlight=driver#c.rtnl_link_stats64 [15:45:15] It says "Not recommended for use in drivers for high speed interfaces." [15:47:15] opened https://phabricator.wikimedia.org/T347312 [15:55:56] thanks folks! [15:57:41] moritzm: we have questions about ldap, testing, and bookworm. [15:57:59] 1) How far away are you from getting openldap to install/run on bookworm? [15:58:46] 2) Right now we have a test ldap server running for our openstack dev cluster; can we interest you in running one of those for us (with our ldap data) on a ganeti host? It might be useful for you for testing anyway... [17:18:34] I'll import a snapshot of current data for testing probably tomorrow, the plan is outlined at https://phabricator.wikimedia.org/T331699 [17:18:58] but going forward I don't think we really a test LDAP server, this only makes things more complicated? [17:19:18] you'd need periodic re-import etc, otherwise you'll always operate with stale data [22:07:50] (hours later) moritzm, we have an ldap fork that stores data for codfw1dev; it's largely different from eqiad1 since there are different openstack users, projects, etc. there. So for the moment we need a server serving that data someplace. Right now that server is cloudservices2004-dev but we're trying to upgrade that server to bookworm and discovering that we can't really run ldap on bookworm atm [22:08:04] you can check in with dhinus if you're curious since he's working on this + in your timezone