[17:12:55] Does anyone know how to fix the phan error in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/806562? I've added a hook to the FlaggedRevs extension and added it to the .phan/config.php file. However, the tests still fail with 'PhanUndeclaredTypeParameter' and 'PhanUndeclaredClassMethod'. [17:13:34] My understanding is that once I added the FlaggedRevs folder to the phan config it would look in that folder for the class, find it and then not raise the issue [17:13:42] However, this doesn't seem to do that. [17:17:30] Dreamy_Jazz: you also need to send a patch to this file https://gerrit.wikimedia.org/g/integration/config/+/master/zuul/parameter_functions.py#511 so that the FlaggedRevs repository is cloned in the CI env [17:17:56] Okay. Thanks, I'll do that. [17:27:11] Quick question before I get too far into this: what do people tend to do for getting MediaWiki-Vagrant set up on macOS M1? VirtualBox obviously doesn't seem possible. [17:27:11] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [17:27:42] Does libvirt work out okay? [18:39:38] Sigh, and seems libvirt isn't too happy with Apple Silicon either. [20:30:14] Eventually got something setup with docker (no vagrant), so never mind on the above. [22:27:29] Nemo_bis: Do you know or have a way to contact someone who knows, where the code for archive.org event stream logic is open sourced, if it is? RE: https://www.mediawiki.org/w/index.php?title=Archived_Pages&diff=3361530&oldid=3044307 [22:28:01] the repo at https://github.com/internetarchive/crawling-for-nomore404/ seems outdated [22:28:10] (context: I wrote an article about it - https://timotijhof.net/posts/2022/internet-archive-crawling/ ) [22:57:34] perryprog: I think most developers no longer use VMs/Vagrant for devs, but rather either use a more lightweight docker-based setup, or use/dual-boot Linux, or run MW directly from php/apache setup with Homebrew. [22:57:57] Running something directly in $CURRENT_YEAR?!?! [22:58:33] perryprog: note that for a (very) basic install you don't need vagrant, docker, mysql, or even apache. If you have PHP (possibly from homebrew) you can run `composer mw-install:sqlite` and then `composer serve` and be on your way. [22:58:49] But really, I was pleasantly surprised by how easy it was to get something running from core's default docker compose setup. [22:59:18] Huh, didn't even think of doing that. That probably couldn't worked in this case, to be honest. [22:59:21] unlike npm and the like, we use very few dependencies and audit them, so while I personally do use docker (see DEVELOPERS.md and mw:MediaWiki-Docker), I wouldn't be *too* worried about runing php directly. [23:01:10] In any event, I assume mediawiki-docker would work on M1, possibly requiring you to opt-in to the x86 profile, but that's a given at this point when using docker, I'm not aware of us using anything that wouldnt' be supported under that translation, and benchmarks suggest x86 on M1 is still faster than the previous gen macs were natively. [23:01:11] M1: MediaWiki Userpage - https://phabricator.wikimedia.org/M1 [23:02:05] Anywya, glad the docker setup worked for you :) [23:02:41] I've just come from a few weeks of working in nodejs exclusively so it's funny you mention that... :P [23:03:29] Having a sane amount of dependencies to just "do stuff" is simply not something I've experienced in a while. [23:08:01] i run mediawiki in a virtual machine for development, but i don't use vagrant or docker, just a normal ubuntu installation where i manually installed all i needed [23:33:37] Krinkle: nice blog post, but I don't understand why you write "the Internet Archive does not try to index the entire Internet" when later you mention that since 2010 it actually does ("Wide Crawl collections represent traditional crawlers. These starts with a seed list, and go breadth-first to every site it finds") [23:36:06] Nemo_bis: "within a certain global and per-site depth limit", generally speaking I understand that only a relatively small portion of what is in a typical commerical search engine index at any given moment is in the archive, even if it's a page that's been around for a few years. It prioritises based on connection and importance (which I think makes a lot of sense), but it just shows a different emphasis. Afaik the crawling is mostly [23:36:06] targetted at getting a shallow gist of the likely-important stuff out there. [23:37:05] I believe at some point the Internet Archive was simply querying for all external links. It might even have been a semi-manual process where they created the seeds for heritrix. Such things can happen also with simple scripts which trigger derivation of an archive.org item by the crawlers, AFAIK. I usually add what I know to https://www.mediawiki.org/wiki/Archived_Pages [23:38:06] Krinkle: that's not my understanding; the Internet Archive tries to archive everything, it just doesn't manage to be fast enough to cover as many URLs as Google or Bing do because of lack of resources. [23:39:24] Also, Google increasingly uses privately reported seeds, such as Google Webmaster users, so it is able to index many URLs that are simply impossible to find by any open method or by following links. [23:49:10] legoktm may know more about the specifics of the IA's EventStream usage, but if you're interested in the code that might be a bit hard as that part of the Archive tends to be proprietary, and sometimes connected to private deals or work for archive-it customers. I've just noticed a new repo "Crawl HQ v3", hmm. https://git.archive.org/wb/gocrawlhq [23:49:28] If you have specific questions I can recommend someone at the IA you can ask. :)