[10:58:35] legoktm: fyi, I recently came across https://github.com/davidcole1340/ext-php-rs and thought you might be interested ;) [11:42:45] <_joe_> mszabo: don't enable him [11:43:11] <_joe_> requirements: PHP 8.0 or later [11:43:23] * _joe_ begins breathing again [11:44:32] :D [13:54:53] "Every service at the Wikimedia Foundation uses the same reporting period: three calendar months, phased one month earlier than the fiscal quarter." /o\ [13:54:57] (for SLOs) [14:24:31] that's not confusing at all [14:25:42] yeah it is confusing :) But it makes sense. If the analysis of an SLO reporting period is intended to impact near-term work prioritization, then we have to finish up reporting a little before EOQ to have time to alter the course of the next Q's planning. [15:26:37] godog: I'm trying to pin down what the impact was of https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite Based on what I'm seeing in the graphs, it appears the impact that for some metrics all past 5y of data is lost for some metrics prior to Oct 12 or Oct 22, but I don't see anything in the comments or incident report stating that, and it's quite a big claim to make for me, so looking for a [15:26:38] confirmation/clarification from you :) [15:30:03] Krinkle: go.dog is on vacation for another couple of weeks, so an email is probably more appropriate. [15:35:41] kormat: ack, sorry, I realize he wrote that in the last comment on phab. I failed to looked there. thanks. [15:35:55] 👍 [16:01:22] anyone knows if something happened to the switches between ~14:10 to ~15:30 CET? (https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1&from=1636030861188&to=1636041661189&viewPanel=92) [16:01:58] there was some unstability and an increase of traffic (https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1&from=1636030902046&to=1636041702046&viewPanel=111) [16:02:08] *instability? [16:02:23] XioNoX: might be related to the network maintenance in -dcops? ^^^ [16:03:30] maintenance was impactless and in codfw only [16:03:56] I think they didn't even start physical stuff until 14:45 anyways [16:04:01] so yeah might not be related? [16:04:32] times are slightly off yep [16:04:57] wait, it's UTC, not CET [16:05:02] oh [16:05:11] what's the UTC window you're asking about? [16:05:24] so there's a 2h shift there, 16:10 -> 17:30 [16:05:40] it's 16:05 UTC now :) [16:05:50] so there's a 2h shift there, 16:10 -> 17:30 CET [16:05:53] xd [16:06:07] there's this suspicious message at 16:10 CET ( cr1 is not happy about something have a big amber light) [16:06:07] I think CET is only 1h off UTC at present [16:06:39] the graphs above should tell (my laptop is now saying 17:06, Paris time) [16:06:40] it's definitely not off by 1:20 :) [16:08:28] so 14:10->15:30 UTC == 15:10 -> 16:30 CET, sorry for the confusion [16:09:04] Berlin time is CET, 17:07. UTC+1, just one hour off since Nov 1st, before that it was CEST and 2 hours off [16:09:48] is this known? [16:09:51] aborrero@cumin1001:~ 1m1s 99 $ sudo cumin cloudcephosd102* "run-puppet-agent" [16:09:51] Caught HTTPError exception: 504 Server Error: Gateway Time-out for url: https://puppetdb1002.eqiad.wmnet:443/pdb/query/v4/nodes [16:10:20] that might be jbond ? [16:10:26] jbond: ^ [16:10:48] why did CET ever come into this conversations? does one of our logs/reports/something show CET? [16:11:06] my laptop and my irc do [16:11:19] (when checking the logs for -dcops) [16:11:25] ah ok [16:11:26] arturo: not sure what happened seems puppetdb died i have restarted it now and seems to be working again [16:11:53] jbond: thanks!, yes works now [16:12:09] cool [16:13:34] XioNoX: so are you sure that the maintenance should not have impacted eqiad at all? (we did an addition of a ceph node right at that time, and my suspicion is that the flaky network + extra high usage ended up in outage for us, we want to try again now, the netwok seems stabilized too, but just wondering) [16:14:28] dcaro: yep [16:14:57] okok, then the network might misbehave again with the test xd [16:20:40] dcaro: please don't break the network ;) [16:22:20] XioNoX: the test started, we saw some pings (icmp packages) getting lost, seems stable now, doing 914GB/s on 10Gb link [16:22:44] dcaro: within the cloudsw switches? [16:23:15] I want to make sure you don't saturate interfaces on asw2-b [16:23:19] yep, though the traffic spreads through b2/b7 also (there's some ceph nodes there), the new node is cloudswitches [16:23:31] (and most of them also) [16:27:54] dcaro: fyi, link usage between row B and cloud is fine [16:29:27] mszabo: I can neither confirm nor deny I have a WIP Rust port of one of our PHP extensions sitting on my laptop ;-) [16:29:49] ;D so you found this already I see [16:32:04] XioNoX: ack, thanks [16:34:31] <_joe_> legoktm: php 8.0 only [16:34:54] * _joe_ waves the crane towards the young hipsters [16:36:33] I was thinking it would make more sense to just have as an external service. I think if someone today proposed "diff generation is slow, I want to use a faster language" you'd recommend they build a service, not a PHP extension. [16:37:23] (also, very WIP! not even close to seriously proposing it yet) [16:39:42] <_joe_> legoktm: agreed [16:41:06] <_joe_> legoktm: I mean it's not like we can't if we want to (spin off a service now) [16:41:36] <_joe_> and it would isolate some chance attack vector security-wise [16:42:25] <_joe_> s/chance// [16:44:06] yeah, I think there are other benefits like removing SRE from the upgrade path, avoiding the PHP+GPL problems, etc. [16:44:18] but something for me to keep chipping away at when I'm bored :) [16:44:22] diff generation is also something that could function as a pure lambda service/function (although it may end up being a bit verbose) [16:50:41] <_joe_> mszabo: yeah that was the idea that we were floating [16:52:33] I have this feeling that after a relatively volatile period of soa related experimentation, both our respective orgs have a much better understanding of when/what is a given problem space where an external service is a good solution [17:45:06] are the puppet issues in codfw expected? I know there were some resolved earlier but I'm seeing things like Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for sessionstore2001.codfw.wmnet: Failed to find facts from PuppetDB at puppet:8140: Failed to execute '/pdb/query/v4/nodes/sessionstore2001.codfw.wmnet/facts' on at least 1 of the following 'server_urls': [17:45:12] https://puppetdb2002.codfw.wmnet [17:47:23] hnowlan: having a look [17:51:49] hnowlan: I've restarted it and seems to work now [17:51:56] looking at what happened [17:51:58] volans: looks good, thanks [17:54:25] a really good critique of fork() that's highly informative and interesting, if you're into that kind of thing :) https://lobste.rs/s/cowy6y/fork_road_2019 -> https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf [17:54:43] [from Microsoft of course, but it really is pretty well-done and valid] [17:55:12] doesn't seem to be very new, but ran across it on lobsters today [18:00:46] I feel like I should already know the answer to this, but what's the best way to xfer a file between two hosts? (In this case `wcqs2002`->`wcqs2001`) I'm thinking maybe a one-off ferm firewall rule to allow ssh from the other host, but not sure what the usual protocol is [18:01:42] Another option might be to scp from `wdqs2002` to a bastion host and then from the bastion to `wcqs2001`, but it prompts me for a password, which I'd imagine is because I have `ForwardAgent no` in my ssh config [18:02:42] ryankemper: Have a look at this: https://wikitech.wikimedia.org/wiki/Transfer.py - You can run this from cumin. [18:02:52] ryankemper: I'd recommend using rsync::quickdatacopy class with puppet [18:04:23] ryankemper: how large is the file? [18:04:26] well, if it's more than one time. but any of these options over one-off ferm changes. scp won't work because you cant forward the agent, only with -3 and that is slow [18:04:53] volans: 28G [18:05:25] mutante: yeah it's a one-off [18:05:47] transfer.py looks pretty convenient, first I've heard of that [19:52:47] heh, from the fork paper: "Traditionally the standard way to build a concurrent server was to fork off processes. However, the reasons that motivated multi-process servers are long gone: OS libraries are thread-safe, and the scalability bottlenecks that plagued early threaded or event-driven servers are fixed." [19:53:02] welll... [19:53:32] not in this ecosystem it is not :)) [21:33:44] bblack: Interesting article. I've definitely been burned by the fork() model (as well as benefited from it) [21:39:24] my main pragmatic takeaway was that, for cases where posix_spawn() *can* work, it's probably advisable to move legacy fork+exec code towards using it. [21:39:57] That's a long road. :-) [21:40:12] whereas my previous best understanding was basically the Linux manpage take on it, which starts off basically implying that you should only care about using posix_spawn() if you're doing embedded development on MMU-less machines. [21:41:08] yeah, it is. But there are some cases that are pretty straightforward to convert, and even modern glibc on modern Linux does gain some perf/efficiency as it doesn't actually use a true full fork() under the hood when calling posix_spawn() [21:43:09] I don't think it uses fork(2) for implementing fork(3). [21:43:18] I usually see clone() in strace output [21:43:39] mszabo: yeah, I agree (not in this ecosystem, always). But I think at the library/kernel layers, the scalability barriers for multi-threaded software have improved greatly since the legacy days. You can build some pretty efficienct stuff on e.g. pthreads+epoll, and there's even better innovations at the kernel level now that are less-well-tested/known. [21:44:07] I think it would be hard at this stage, in a new codebase, to justify a forking-processes model as some kind of scalability win over threads+events [21:44:49] and it is *really* hard to mix fork and threads safely and not mess things up. [21:46:00] There was suspiciously no mention of Apple's libraries which will disable the use of fork() past a certain point. [21:46:32] you mean they disable fork() success after you've started up runtime threads or something? [21:46:37] I hadn't heard about that [21:47:05] yeah [21:47:17] I'm trying to remember the details. [21:47:19] (and yeah, glibc uses clone for fork, but it uses clone different for posix_spawn) [21:49:06] anyways, it's on my todo list to experiment on this by seeing if I can convert most of the fork+exec patterns in gdnsd to posix_spawn() [21:49:34] there's also some fork-without-exec that wouldn't go away, but that's only for legacy (non-systemd) "daemonization" in early startup. [21:50:00] not strictly necessary in any OS, really, but the patterns matches some expectations of some init scripts or users or whatever [21:52:20] I think I have 3x fork->exec cases in the codebase, all of which happen (carefully!) at runtime with concurrent threads executing. I think at a glance they could all be migrated by using the spawn flags and file actions in the posix_spawn -related APIs. [21:53:01] (as all they do between the two calls is generally some kind of sigaction calls or closing FDs and similar bits) [21:53:23] we'll see! :) [21:53:52] Good luck! [22:20:51] <_joe_> I think I read that article some time ago, and yes it is interesting but also I thought it's for most practical cases, irrelevant [22:22:17] <_joe_> for process initialization from pid 1 and/or shell, I mean [22:24:51] <_joe_> dancy: interesting thing re: the apple libraries, I didn't know about it [22:28:28] yeah, my initial impression is that their argument (which seems reasonably, really) is that Linux and other kernels should come up with a better posix_spawn()-like interface that supports all the reasonable use-cases, deprecate fork/exec, and eventually replace them with less-efficient emulations that allow the rest of the system/kernel to operate and be structured in a less fork-constrained way. [22:29:00] which seems like a noble aim, even if it might take a while to reach the end of that road [22:30:34] (and here's to hoping that whatever improvement is made on posix_spawn(), that it's at least informally standardized across major kernels, if not actually-standardized in posix or something) [22:30:57] s/kernels/libcs/ in some cases, whichever layer implements the various peices of the puzzle [22:31:33] <_joe_> bblack: AIUI the main issue with fork is that it's not thread safe. Everything else is about "we can't make POSIX microkernels easily if fork is still there" [22:31:48] <_joe_> which.. ok. Not my problem [22:32:35] <_joe_> (I once compared microservices to microkernels: you need a very good reason to complicate your life that much) [22:34:20] <_joe_> microkernels can be interesting in the context of servers? maybe they can offer better isolation to container-like payloads, but it's all quite theoretical and if one needs stronger isolations there are options [22:36:17] <_joe_> (and for instance, amazon's firecracker is now decently integrated with kubernetes, AIUI) [22:40:44] yeah I kinda saw the microkernel implication in there too, but I think it's fair to say that even in a monolith kernel, the constraints of the fork model probably force a centralization of certain data structures that can affect scalability, etc. [22:41:27] and the threads-safety thing, I mean, it's hard on app devs trying to use fork+threads safely, but it's also hard on everything else in the kernel/libc dealing with those edge cases in a sane and bug-free way too. [22:42:53] <_joe_> also posix_spawn seems like a pita to use, just sayin [22:42:58] it does! [22:42:58] bblack: definitely, PHP seems to be pretty much an outlier in this regard with its preference of a multiprocess model due to its share nothing philosophy - ZTS doesn't seem quite as performant, or popular, or better supported by extensions/core proper [22:43:37] <_joe_> mszabo: you know we used a non-multiprocess implementation of PHP for a few years, right? [22:43:53] <_joe_> let's say things like acpu were a *tad* more performant [22:43:56] well HHVM presumably does things smarter than ZTS :)) [22:44:26] and yeah it would help with apcu and other stuff that needs to operate across the share nothing boundary, it's just there isn't really much of those to begin with on the Zend side of things [22:44:52] I'm not privy to HHVM internals but probably it's a cleanroom implementation that was never saddled with whatever baggage ZTS is carrying with itself [22:44:58] <_joe_> yeah the genius idea of php, "the scope is the http request", is also what brought that model [22:45:26] <_joe_> especially when php was strongly tied to httpd which was using a prefork model anyways [22:45:54] keep in mind, there's probably lots of examples of the fork->exec() pattern out there which are inherently unsafe/buggy. The range of things you can legally do safely between the two calls is so limited, it should be possible to wrap it up in a single call with some data structures built, the way posix_spawn() does, and cover all the safe cases. [22:46:08] <_joe_> so yes, the reason why php-fpm is multiprocess is that 1) threads are hard 2) you were naturally not in need of them in most cases [22:46:08] maybe their API for that is ugly, but it's on the right track in terms of shape and capabilities [22:47:51] ZTS in the Zend engine is a mess in part it was added as a bolt-on for working with IIS and the internal docs on it are horrible [22:47:54] <_joe_> oh sure, I read the manpage earlier and I was... ok now give me a higher level library I can call to do exactly this with most reasonable defaults for all those damn options [22:48:30] <_joe_> because it's clear posix_spawn() tries to do the right thing [22:48:55] but yeah outside of PHP world, from what I gathered, "state of the art" server runtimes like netty favor a model where a small bounded thread pool leverages kernel level async IO facilities like epoll/io_uring [22:49:12] yeah [22:49:24] <_joe_> yes, pretty much anything works upon a variation of that [22:49:49] I'm in the camp that favors that kind of model too (the camp that thinks threads are awesome, but a given pool of identical threads, should have a pool size at most a small multiple of the available CPU cores, as opposed to e.g. thread-per-connection/request) [22:50:17] I <3 PHP's shared nothing approach, but that too is really an artifact of its origins. php-fi was a cgi script. [22:50:35] <_joe_> bd808: yeah and that's the sole reason of php's success [22:50:59] totally! shared nothing is sooo much easier to reason about [22:51:03] <_joe_> more than any of the ludicrous design choices of that language [22:51:03] yeah, our java microservices use jetty under the hood which follows the legacy threading pattern, and sizing the thread pools/queues properly is a main in itself [22:51:14] varnish exists in one of the other corners of this design space. They love threads, but they need many thousands of them because it's a thread-per-thing sort of model (if you want any part of 10K requests/conns making progress in parallel, you need 10K+ threads) [22:51:22] php is not one language, it's 3 languages in a trench coat that wait for you in a dark alley and hit you in the head [22:52:04] that kind of threading model was more-popular/acceptable in the 90s (and yeah you still see a lot of it in e.g. javaland), but I think it's a pretty outmoded way of doing things now. [22:52:38] yeah, even more recent JVM frameworks like oracle's helidon use a netty core instead (and they still implement the Jakarta EE stuff on top of it) [22:53:15] <_joe_> bblack: actually I think jetty's model is an intermediate - you can define a relatiely small threadpool and have a rather generous queue [22:53:30] <_joe_> and it makes it more efficient than allowing a ton of threads too [22:53:46] that's sort of what we ended up doing yeah, trying to gauge what's the reasonable number of threads with some basic load tests and sizing the pool and queue accordingly [22:53:50] <_joe_> from my sad memories from a past job, at least [22:53:57] typically we ended up with thread pools in the low hundreds [22:54:26] yeah, all these things are pretty handwavy until you get into the specifics of your real needs, then there's always reasons to deviate a bit [22:54:29] <_joe_> mszabo: which is not bad, because your java apps would make a lot of external calls [22:54:51] <_joe_> so you had a lot of threads waiting on i/o I guess :) [22:54:54] yah, it's all sync IO [22:55:23] <_joe_> in a way, mediawiki is more cpu bound than io bound, which is refreshing compared to most other applications [22:55:24] I think varnish's philosophy on its many-threads model mirrors how they offload a lot of storage mgmt to libc/jemalloc as well. It's a sort of "let the OS handle what the OS can handle, that gets more attention and should be better at it" [22:55:47] so the OS is handling their hundred-thousand-thread for hundred-thousand-reqs concurrency, and managing paging/memory/storage for them too. [22:55:57] that's interesting, I thought it was still io bound given a high enough parser cache hit ratio [22:56:20] <_joe_> mszabo: yeah, but those are the fast, uninteresting requests :) [22:56:27] ;D [22:56:35] <_joe_> take a look at our parsoid cluster, that is cpu-bound [22:56:39] and their philosophy is not without merit, but... the OS doesn't know exactly what your app code is doing, and so its generic attempt to handle all such cases will never be close to optimal, for a heavy/specific usecase [22:56:54] I remember running the Parsoid test suite on PHP 8 with JIT, and it appears to have actually brought a measurable performance improvement for that workload [22:56:59] something in the double digit seconds or somesuch [22:57:04] <_joe_> mszabo: SHHHH [22:57:11] <_joe_> don't let Tim hear you [22:57:24] <_joe_> or he'll come back insisting we move to php 8 now [22:57:36] :D [22:57:43] 8.2 please and thank you [22:58:37] wasn't there a Java implementation of remex before it became a PHP library? [22:58:42] depurator or something [23:00:01] ah, https://www.mediawiki.org/wiki/Html5Depurate [23:05:37] to be fair, one disadvantage of Java is that it is not nearly as attractive for hiring purposes as e.g. golang [23:06:04] mszabo: we are a PHP shop :) [23:06:29] the least good "recruiting" language [23:06:48] yeah, same thing ;) [23:07:17] my hatred for Java is mostly old wounds from 1999-2006 and the JVMs of that era being utterly crap at GC at scale [23:07:18] the upside of being a relatively-bigger shop that's primarily using some $less_popular_language , is that you can nab a greater share of the best engineers who know it and/or are willing to work in it. [23:07:31] and the job titles do not have to include "ninja", "hacker" or "mage" [23:07:37] there's great engineers that love every language, but in the $popular_language pools, it's harder to hire the best of them :) [23:08:31] definitely, it can act as a good filter [23:09:47] I never stop giggling about all the "PHP is crap" hate and how so much of the web happened exactly because PHP is not crap in practice at all. MediaWiki, Flickr, YouTube, Etsy, Facebook, ... [23:11:08] I guess their counter to that would be "did that happen because of or in spite of PHP", but the fact that Facebook, with the resources and talent available at their disposal, opted to create a fresh PHP runtime instead of moving to a different language or writing a different one, sort of speaks volumes here I feel [23:11:56] also most of that hate usually ends up referring the same old blog post from 2012, the reign of PHP 5.x [23:22:11] Ultimately "choose boring tech" is the part that matters most I think. At least until the complexity of your problem and the funding for its solution allow you to afford more complicated things. [23:22:25] http://boringtechnology.club/ [23:44:38] is anyone old enough to remember why LocalisationUpdate got disabled? [23:47:37] specifically 2017-03-24T02:24 l10nupdate slows down recovery: !log Killed l10nupdate on tin, was blocking emergency pushes [23:53:00] task digging looks like a scap lock [23:54:03] but we had someone deploying (which is basically for us an rsync wrapper + composer install + update gitinfo cache) and l10n kicked off a few minutes before [23:54:10] and php workers exhausted [23:54:31] on deployment.miraheze.org [23:54:58] (which made deploy tool kill itself as it saw the outage) [23:55:11] php-fpm required restart to unlock [23:55:44] i saw a significant drop in available workers too during a test just now [23:56:14] RhinosF1: it was a very fragile system that broke a lot. Once we got to weekly train deploys the general consensus from those of us who had worked on l10update was that wikis could live without daily message updates [23:56:49] bd808: i think it may have tried to take miraheze down [23:57:02] all the code is still in puppet, it's just 'run_l10nupdate => false' [23:59:17] mutante: T158360 We could probably finish up removing it/ [23:59:18] T158360: RFC: Reevaluate LocalisationUpdate extension for WMF - https://phabricator.wikimedia.org/T158360