[03:43:25] PHP's DOM library, as used by Parsoid, is looking at major changes in the next release. RFC is being drafted and refined on internals-l. Probably open to vote soon. https://wiki.php.net/rfc/domdocument_html5_parser [08:23:08] https://www.irccloud.com/pastebin/Q2KFQn9P/ [08:23:31] maybe I missed some update but confctl (see output in the link above) shouldn't announce on IRC that kind of action? [08:24:17] * _joe_ hands a pair of glasses to vgutierrez [08:24:27] <_joe_> the problem must be tcpircbot [08:27:09] mmm maybe I didn't pay attention but when I used confctl in the past I don't remember any announcement [08:27:35] wow.. that means that it's been broken for the last.... 4 months? :) [08:29:10] vgutierrez, fabfur: we do have logs from yesterday: https://sal.toolforge.org/production?p=0&q=%22conftool+action%22&d= [08:29:16] sorry 2 days ago [08:29:23] volans: ok, that's fabfur being fabfur [08:29:57] that's why I begin my sentences with "maybe I didn't pay attention..." [08:30:00] :D [08:30:06] last one from puppetmaster1001 seems to be aug. 7th [08:32:05] strange, if I search for my username on sal.toolforge.org if find message like `depooled cp1090.eqiad.wmnet to test new purged package version (T346874)` but I should find also `confctl` messages [08:32:06] T346874: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 [13:52:46] @urandom @inflatador i am still having issues with lists1004 failing for root file system this is a SW raid install of debian any idea? https://usercontent.irccloud-cdn.com/file/EX4kPRIV/Screenshot%202023-09-27%20at%209.51.09%20AM.png [13:55:39] jclark-ctr: I don't, no. :/ [14:03:15] jclark-ctr we removed my configs yesterday, so that shouldn't be affecting it [14:03:30] FYI, I'm about to repool eqiad (services first, traffic afterwards) [14:04:36] you can see the status via `sudo cookbook -d sre.discovery.datacenter status all` [14:09:33] <_joe_> api and appservers repooled [14:09:57] <_joe_> oncall people: eqiad is back to being active [14:10:08] <_joe_> as in, appservers are getting read-only traffic [14:10:46] <_joe_> latency isn't great there [14:10:56] Nope, they didn't like it [14:11:13] <_joe_> claime: it should come back ina few mins [14:11:16] yep [14:11:18] I'm not worried [14:13:17] yeah, I'll start panicking only if it doesn't get better :D [14:13:36] <_joe_> uhm [14:13:42] <_joe_> it's not looking great tbh [14:13:52] <_joe_> not worrisome [14:13:59] It's going back down but tapering [14:14:33] <_joe_> ahhh right [14:14:41] <_joe_> things like sessionstore are still in codf [14:14:44] <_joe_> *codfw [14:14:46] yeah [14:14:51] also mw-on-k8s [14:15:38] Well mw-web-ro is back on already, but not the rest [14:16:16] <_joe_> that wouldn't influence latency though [14:16:47] Nope, but I wonder if it's as bad [14:17:31] it isn't as bad, but it's benefitting from bare metal starting to warm the cache I guess [14:17:35] <_joe_> latency seems back to norm already for api [14:17:57] appservers are stil ~300ms [14:18:00] <_joe_> nah still 1.5x the norm but ok [14:18:25] <_joe_> claime: yeah let's see when the switchover is done [14:18:30] yep [14:18:49] Maybe we should order the cookbook so it repools sessionstore first [14:19:10] search isn't barfing! Yay [14:19:40] sessionstore pooling in progress [14:20:20] if latency magically goes down, I'll add a "reorder cookbook" to my infinite TODO list :D [14:21:26] <_joe_> kamila_: you have to wait the TTL for that to be fully effective as envoy IIRC waits the TTL before re-resolving in STRICT_DNS mode [14:21:42] ack, thanks _joe_ [14:21:46] <_joe_> Amir1, marostegui how are the dbs in eqiad doing? [14:21:59] so far they look okay [14:22:07] dbstore is down for an hour [14:22:14] unrelated I think [14:22:21] <_joe_> then I don't get the latencies rn [14:22:46] <_joe_> ah I think I do [14:22:48] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=eqiad&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All&from=now-30m&to=now&viewPanel=27 [14:23:12] memcache cold? [14:23:13] <_joe_> when that goes down a bit further, we're at norm [14:23:18] <_joe_> claime: that too yes [14:23:29] _joe_: after the initial spike, they are coping fine now [14:23:36] they shouldn't be that cold after a week [14:23:47] <_joe_> marostegui: you mean the dbs? [14:23:53] <_joe_> yeah that's what we hoped for :) [14:24:07] <_joe_> ok so, let's re-check in 30 minutes kamila_ claime [14:24:20] <_joe_> if things haven't settled, it might make sense to look at flamegraphs [14:24:26] yeah [14:24:45] ack [14:25:17] That dbstore downtime was me. It was unrelated, as Amir1 said. [14:25:28] <_joe_> latency is pretty bad on appservers even in codfw [14:25:36] <_joe_> compared to my memory [14:25:39] <_joe_> so it's about 150 ms [14:25:56] <_joe_> so we're perfectly in the realm of not alarming [14:28:01] still going down though, I don't feel like panicking any more than usual [14:28:54] eh crap did it stop going down just because I said that? '^^ [14:33:42] it's not that bad :D [14:35:00] I might have messed up all innodb caches by running that massive schema change last week :P [14:35:25] x) [14:37:40] <_joe_> so latency in codfw before the switchover was 190 ms [14:37:44] <_joe_> mean latency [14:37:50] we're still at 50% over baseline memcache adds, latency stable around 250 [14:37:55] <_joe_> we're at 240 and going down slowly? [14:38:36] we're not going down anymore, at least not on appservers [14:38:41] It's tapered off [14:39:19] <_joe_> let's look in 1 hour at flamegraphs [14:39:44] I'll be in an interview then, but y'all don't need me for that :p [14:40:06] <_joe_> claime: me too heh [14:40:25] <_joe_> Amir1: can you show kamila the flamegraphs in 1 hour if things haven't recovered? [14:40:25] okay then I'll just look at graphs and scream, since that's all I know to do :D [14:40:52] <_joe_> no need to scream, the situation is under control, generally speaking [14:41:17] kamila_: maybe just an subdued, inner scream, then :) [14:41:36] <_joe_> urandom: are we replicating parsercache btw? [14:41:43] <_joe_> cross-dc, I mean [14:41:55] yes [14:42:10] assuming you mean via restbase/cassandra [14:42:24] that's kind the only mode there [14:42:25] I thought you were asking if internally screaming was replicating parsercache. Seemed fitting. [14:42:29] kind of, even [14:42:33] <_joe_> no I meant the databases mediawiki uses [14:42:44] oh, that parsercache [14:43:20] <_joe_> ok something is strange [14:43:23] that I don't know... marostegui ? [14:43:27] <_joe_> mw on k8s has a much lower latency [14:43:29] <_joe_> in eqiad [14:44:34] yes, I think I pointed it out earlier, or maybe I just thought about it and didn´t [14:44:42] <_joe_> p75 on mw-web is 209 ms [14:46:25] (that's not the benthos metrics, so it's probably not because the metrics are wrong XD) [14:46:26] _joe_: damn [14:46:36] ah no [14:46:43] yes we are form codfw to eqiad [14:46:48] you scared me for a sec [14:46:48] <_joe_> ok [14:46:53] <_joe_> why> [14:46:58] <_joe_> I was asking :P [14:47:46] cause for some reason my mind went to default = eqiad [14:49:20] _joe_: mcrouter latency is much better for mw-on-k8s [14:49:25] As to why that is [14:50:57] <_joe_> ok interview prep time for me [14:51:11] same [15:03:09] sorry, I was deploying something, looks like repool went ok, anything weird or pending? [15:31:13] jynus: just somewhat increased appserver latency [15:32:05] deploying stuff wasn't an issue (and also it's my bad, I didn't add the repool to the deployment calendar) [15:39:59] Anyone in the middle of a puppet-merge? [15:40:44] stevemunene: puppet-merge should tell the user you if it conflicts [15:41:29] robh in this case [15:41:43] sorry [15:41:49] getting a `E: failed to lock, another puppet-merge running on this host?` [15:41:52] doggo barked and i went afk mid merge like a newb [15:41:55] im out [15:42:04] stevemunene: see the other line sshd(robh) [15:42:04] (so should be unlocked now) [15:42:20] robh: no worries, was trying to show how to debug it [15:43:10] Ack, robh had you merged your changes? I can still see them [15:43:21] i didnt cuz i let it sit so long [15:43:26] so feel free to merge my vchange with yours [15:43:31] its very small single stanza for some new servers [15:43:38] Great, doing so rn [15:44:22] mail delivery showed up so i ran from keyboard due to dotggo barks [15:44:23] heh [15:51:25] latency is still a tad high, but not terrible, so I'm proceeding with repooling eqiad for traffic [15:52:47] Amir1: which flame graphs, if you have a moment? [15:58:55] brett: I have a conflict with your patch in puppetmaster1001, can I merge it? [15:59:03] dhinus: Looks like "Add new cloud restricted bastion" is yet to be merged on puppet master. Is it okay for merging? [15:59:09] hahaha [15:59:12] love it [15:59:12] yes go ahead :D [15:59:54] you've got the lock, so you go ahead [16:01:35] dhinus: ^ [16:05:12] sorry I'm in a call, doing it now [16:05:17] many thanks [16:06:25] done, sorry for the wait! [16:42:17] gah, there it is again: `error: symbol 'grub_file_filters' not found.`, and I'm dropped to a grub rescue on the first boot after a reimage 😕 [17:11:06] OK...that rules out my netboot changes [17:28:57] yeah, the more likely cause is that I am cursed. 💀 [17:29:37] urandom: I ran into that some time ago, maybe this helps? https://phabricator.wikimedia.org/rOPUP49b09ee63a0524b9b47008c1be2401c020859edb [17:31:30] herron: if I retry the install (as many as 3x), it succeeds... [17:32:44] I guess this is a misconfiguration? [17:32:48] ahhh, clearly we need a sre.hosts.reimage3times [17:32:56] (your link, I meant...) [17:33:35] yeah, or add a "burn sage", or "light black candle" step [17:33:57] yeah iirc when I ran into the issue it was because the disk layout was trying to reuse some partitions and grub didn't get reinstalled correctly, something like that. so adding the explicit line solved it [17:35:39] huh, this is a partman config that reuses... [17:36:08] how many disks on the host(s)? [17:37:10] 3 disks [17:40:42] maybe 'd-i grub-installer/bootdev string /dev/sda /dev/sdb /dev/sdc' would do the trick [17:42:23] I'm wondering why it works most of the time, but not others [17:43:36] herron: is that what you experienced? [17:44:12] hmm I don't remember it ever working in my case but also I think I only tried once [17:45:42] I do remember that grub was completely broken though, couldn't even manually boot the host [17:55:04] sorry I was afk [17:55:12] kamila_: is it resolved now? [17:56:00] https://performance.wikimedia.org/php-profiling/ [17:57:39] you can compare https://performance.wikimedia.org/arclamp/svgs/hourly/2023-09-27_16.excimer-wall.all.reversed.svgz and https://performance.wikimedia.org/arclamp/svgs/hourly/2023-09-27_12.excimer-wall.all.reversed.svgz [17:57:54] (today 12:00 UTC vs 16:00UTC) [17:58:21] Amir1: we decided it's probably fine, but I'll have a look, thank you! [17:58:30] memcached is taking longer to respond, jumping from 5% to 7% of response time [17:58:41] but dbs haven't changed, 12% [17:59:20] sorry I wasn't feeling well, rested a bit [18:16:11] No worries, it's not bad enough to be urgent [18:16:19] Thanks a lot for the pointers!