[00:25:16] The "Fatal error page" impression rate at https://grafana-rw.wikimedia.org/d/000000438/mediawiki-exceptions-alerts is suspiciously identical to the total logged excepetions in Logstash. This isn't normally the case since a lot of logged exceptions are caught errors, post-send errors, CLI, or other errors not resulting in a Fatal error web page being shown to a user. [00:25:35] Looks like it is plotting the... Logstash doc count for logged exeptions. I.e. the same as the first panel on that row. [00:26:03] The php-wmerror metric was ported to dogstatsd/prometheus in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017078 [00:26:22] but... trying to plot mediawiki_fatal_errors_total yields nothing [00:26:36] which makes sense I suppose since T356814 is still unresolved. [00:26:37] T356814: Migrate MediaWiki.errors.fatal to statslib - https://phabricator.wikimedia.org/T356814 [00:36:24] * Krinkle comments there [07:15:56] inflatador: re: preseed/partman testing, if it was for work please give https://gitlab.wikimedia.org/repos/sre/preseed-test a try and let me know, if you haven't tried already [09:12:26] I'm stuck trying to reimage sretest2010 (changing it from sretest to swift::storage). The reimage cookbook explodes early doing the puppet version check, because sudo puppet lookup --render-as s --compile --node sretest2010.codfw.wmnet profile::puppet::agent::force_puppet7 fails [09:12:50] That's failing because there aren't the right partitions for swift::storage (which I expect, that's why I want to reimage it). [09:13:43] I can't avoid the check by passing -p 7 because it's not a new host [09:16:29] Any suggestions? I can't see how to get it into a clean state so I can run the reimage with --new (running the decommission cookbook is overkill and I think would leave it needing more recommissionining again) [09:29:29] OK, the answer was to choose violence [09:29:56] (log onto system, blythly repartition /dev/sda and /dev/sdb under the feet of the OS to add the extra partitions) [09:32:55] I would have cleaned it from puppetdb and use --new :D [09:33:36] volans: how does one "clean it from puppetdb"? [09:35:23] the other option was also potentially decom(with --keep-mgmt-dns)+reimage [09:35:38] Emperor: I don't think we have a quick cookbook for it but via spicerack-shell it's basically a single API call [09:49:53] Isn't removing the host from puppetdb basically puppet node clean puppet node deactivate? [10:01:09] or https://doc.wikimedia.org/spicerack/master/api/spicerack.puppet.html#spicerack.puppet.PuppetServer.delete :D [10:13:08] spicerack delete> TIL, thank you :) [10:13:25] If you'd like some cursed knowledge in return, I offer https://github.com/jloughry/BANCStar [10:14:08] lol [10:31:07] vgutierrez: https://phabricator.wikimedia.org/T390813#11161877 that's interesting, shouldn't it be tuned to tries to send traffic to the working nodes in case of a rack failure ? [10:31:58] then we should relax the the depool threshold to a 0.5 on the PoPs [10:32:40] just a thought, maybe there are other implications [10:34:21] we work under the assumption that we can only afford losing 2 cache servers per cluster [10:35:36] without knowing all the trade-offs relaxing to 0.5 at pops would seem to make sense [10:36:13] even if a switch failure at a pop means we have to depool, in the time between the failure and the depool being done / taking effect we should try to server requests with available resources in the remaining rack [10:37:04] that could save some users if the outage happens during an off-peak period [10:42:49] we got certain limitations at the moment [10:43:04] for starters service.yaml doesn't allow to set a depool_threshold per site [10:43:33] I'll bring this to the traffic meeting and I'll report back [12:38:30] I can't seem to be able to reach wikidough in esams over ipv6.. dumped some debugging information here: https://phabricator.wikimedia.org/P82942 [12:41:32] sukhe, XioNoX: ^^ could be related to the recent change to routed ganeti? [12:55:41] looking [12:57:52] looking as well [12:59:40] (you can append +nsid to kdig to get the specific doh* host in question) [13:00:11] XioNoX: seems like we are advertising the v6 via bird at least, leaving the other checks to you :) [13:00:17] sukhe: the issue seems to be on ganeti3005/doh3006 [13:00:35] ok, so that's what taavi is probably hitting then [13:00:36] pings make it to ganeti3005, but not further to the host, not sure why yet [13:07:03] XioNoX: you sure they make it to the ganeti host? [13:07:26] topranks: yeah, something is not right the host is importing the prefix, forwarding it to the switch, but not installing it to its routing table, looks like it does the same for v4 too... [13:08:21] https://phabricator.wikimedia.org/P82955 [13:08:28] yeah I don't see my pings hitting the host [13:09:23] weird, I see mines [13:09:56] sorry the _host_ is not importing to table? ok [13:10:10] ganeti3005 I think, yeah [13:11:07] these lines are only exporting static routes to the kernel from Bird RIB? [13:11:11] https://www.irccloud.com/pastebin/UTt8yzAs/ [13:11:36] yeah, was going to say the same, I think we need to add `186` (from /etc/iproute2/rt_protos) [13:12:50] though actually.... that's just "import" there [13:13:03] I think we need "export", and it won't be from a kernel rt_proto, but rather bird itself [13:14:44] topranks: yep, fixed it manally and it's now pinging [13:15:10] still not pinging for me, but I suspect what's happening is i'm hitting another instance (and explains it "missing" in tcpdump) [13:16:39] whenever in doubt, kdig/dig +nsid :] [13:17:06] topranks: not that you don't know this already but as a reminder since you are deep debugging [13:17:18] whenerver in doubt, ping/tcpdump :) [13:17:35] well ICMP might go a different way than UDP 53 etc [13:17:48] when I'm in doubt I just leave it to you guys who know what you're doing :P [13:18:08] topranks: that's what I do (to you both) [13:18:19] em but jokes aside yeah that should work, I think maybe we can have: export ~"vm_v6" [13:18:32] so this is just ganeti3005/doh3006? [13:18:34] or something sikilar to that, wildcard matching naming convention of the "protocol" for the V6 peers [13:18:41] sukhe: nah it's all :/ [13:18:45] ah [13:18:47] in esams [13:18:53] I wa like why would that be any different [13:18:54] ok [13:19:21] we moved to routed ganeti in esams [13:19:31] completely different routing setup [13:19:38] yeah [13:19:50] but I was asking if doh3005 is also broken or not and what makes it different to doh3006 [13:20:22] should be the same for all of them [13:20:37] though it's a per-hypervisor fix, i.e. if it's fixed on ganeti3005 will affect all the VMs on it [13:21:02] godog thanks for the link. It wasn't for work but it should be handy regardless ;) [13:21:03] topranks: we already do some light filtering on what we import from BGP peers, so I think it's fine to export all to kernel [13:21:11] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1186501 [13:21:16] eh... [13:21:28] you'll be trying to re-export the routes you "import" from the kernel already [13:21:34] I think we should be more specific personally [13:22:12] +1 on the patch but ideally we'd tighten it up to only export routes learnt from BGP [13:22:51] it's harmless probably to re-export the imported ones but still I tend to prefer to keep that kind of thing as clean as possible [13:23:13] yeah agreed, looking on how to do it like that [13:23:35] inflatador: sure no problem! it is tailored to work with our partman/preseed settings out fo the box, though will work for any testing with minor adjustments [13:23:35] also things are currently broken [13:26:41] topranks: updated the CR, and manually tested on ganeti3005 for v6 [13:30:36] ah very nice! [13:30:45] I'm just confusing myself with the bird docs here [13:31:11] yep that's definitely the way to do it [13:31:30] yeah their docs are not great for all the filtering [13:31:43] not sure if you can do a regex on the "protocol" name, probably is some way but source = RTS_BGP is the best way [13:31:53] I'll disable puppet on all the routed ganeti and carefully roll the patch [13:31:56] they are good and comprehensive but not very user friendly, especially in a rush [13:31:57] ok [13:31:58] +1 [13:37:11] follow-up question: why did no alerts catch this? [13:38:22] yeah that is a good question. from a high-level we are probably just monitoring things like the BGP state, but not the end-to-end comms or service? [13:38:37] if we are monitoring the service perhaps we are monitoring the anycast IP from say eqiad and thus it was unaffected? [13:38:38] the other issue is that it's an anycast VIP [13:38:42] yeah [13:38:59] this particular issue won't ever happen again, so bgp status is probably still fine to be checking [13:39:00] and monitoring is too far away [13:39:24] https://phabricator.wikimedia.org/T311618 :) [13:39:26] but perhaps we should think of a way to monitor the service itself to catch weird unpredictable things like this, in a way that works best with the anycast setup [13:39:28] if anyone wants to work on it [13:45:22] yeah... [13:46:21] we have done a few things to improve our bird setup but the blackbox exporter is important for sure and has been on the backburner for a while [13:46:29] I will see how to triage it and possibly take it on [13:47:14] sukhe: you're a manager you should delegate :) [13:47:29] I'm now able to ping :) [13:47:37] https://www.irccloud.com/pastebin/HwugZk1O/ [13:47:48] XioNoX: once we have enough people in the team :] [13:47:53] why no reverse dns! :sadface: [13:47:56] topranks: looks like you were going though ganeti3006 then [13:48:08] sukhe: you can think again about stealing any more of _our_ people!! [13:48:23] ! [13:48:36] XioNoX: yes, likely, last hop I was seeing was asw1-bw27 [13:48:42] topranks: be careful, the v6 reverse DNS might be a rabbit hole :) [13:49:01] he lives for that. it's like dopamine for him. [13:49:04] ah it's ok, I climbed down that one years ago [13:49:11] getting back out is my probleM! [13:49:23] yep, works for me again as well [13:49:29] thanks for reporting taavi. [13:49:45] cool, yeah thank you all! [13:49:53] XioNoX: topranks: I guess my main concern is not the Wikidough hosts so much but the ns2 anycast [13:50:22] because well, that's more critical. so yeah, let me discuss with the team and get the blackbox exporter to fit in somehow in Q3 or something [13:50:37] sukhe: yeah, blackbox would be the closest to what can have a user impact [13:51:05] sukhe: o11y also did a lot of tooling around it, so it might not be that complex to implement [13:51:09] sukhe: indeed yeah [13:51:40] and thanks for fixing! (unfortunately the way I found out about this was because it broke my mail server :/) [13:51:51] ns2 is not an a VM so not affected by this, but either way same sort of simple error could be made with some future change [13:52:08] and definitely the impact much more important so getting those blackbox checks in place would be good [14:00:57] Fwiw the reverse DNS is set up correctly for the wikidough IPv6 IPs on our authdns [14:01:11] the delegation for the range is missing in RIPE I'm adding it there now [14:03:11] taavi: thanks for the report at least you spotted it! [14:03:44] Reverse dns working now [14:03:48] cathal@officepc:~$ dig +short -x 2001:67c:930::1 @1.1.1.1 [14:03:48] wikimedia-dns.org. [14:41:14] nice! [15:37:26] btullis: are you the one to thank for providing ceph packages on wm repos? e.g. https://apt.wikimedia.org/wikimedia/dists/bookworm-wikimedia/thirdparty/ceph-reef/ [15:43:14] andrewbogott: no, they're my doing [15:43:25] well then, thank you :) [15:43:40] (I use them to build the container images to deploy with cephadm) [15:43:46] How hard would it be to get reef packages in the equivalent spot for Trixie? I have a bunch of reimages ahead of me and it would be nice to skip a level [15:45:39] andrewbogott: I could be wrong, but I'm not sure upstream has published reef packages for trixie (yet?)... [15:45:56] stock debian trixie ships with reef, but maybe that's a different build? [15:47:08] yes, Debian does its own builds. [15:48:52] hmmm [15:49:04] where should I look to see what the upstream has provided? [15:52:07] ah, that would be https://download.ceph.com/debian-reef/dists/ [15:52:14] you're right, only bookworm [15:52:14] hm [15:52:31] * andrewbogott wonders about going from ceph packages to debian packages and then back to ceph [19:19:01] Has anyone installed Trixie lately? It's telling me 'No kernel modules were found,' I imagine that means we need a fresh install image [19:20:00] weird. I haven't tried it yet myself [19:20:13] andrewbogott: I have 2 hosts with trixie in prod [19:20:52] people1005.eqiad.wmnet if you wanted to compare something [19:20:56] yeah, there was just a point release though, maybe the kernel package that our installer wants isn't there anymore [19:21:03] oh, that could be it, yea [19:21:11] That's a mor.itz question right? [19:21:15] I think so, yea [19:22:40] ok [19:28:52] andrewbogott: if you need bacula backups on the host you are upgrading, you might want to hold back anyways [19:29:30] there's no state of importance on it, but I'm stuck anyway until Trixie works [19:30:13] ack [20:27:51] in trixie, the docker binary was moved from docker.io package to docker-cli package. docker.io recommends docker-cli. but we are setting apt.conf.d/00InstallRecommends:APT::Install-Recommends "false"; globally. so we are not getting any recommends. result: you get docker.io installed as always.. no errors but then surprise. "docker: command not found". [20:28:37] #debian has the opinion that disabling recommends globally creates traps like this