[03:28:45] !log wm-bot updated cloak in admins config [03:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wm-bot/SAL [11:14:06] andrewbogott: done, sorry about that [11:14:16] (next time feel free to just kill my session if I don't respond) [11:14:58] !log tools reboot tools-sgebastion-10 for T301736 [11:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:15:02] T301736: toolsbeta.automated-toolforge-tests membership causes "groups: cannot find name for group ID 54872" error message - https://phabricator.wikimedia.org/T301736 [11:16:07] !log tools purge debian package `unscd` on tools-sgebastion-10/11 for T301736 [11:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:49:22] !log tools invalidate sssd cache in all bastions to debug T301736 [11:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:49:26] T301736: toolsbeta.automated-toolforge-tests membership causes "groups: cannot find name for group ID 54872" error message - https://phabricator.wikimedia.org/T301736 [13:34:21] !log tools.lexeme-forms deployed 56e69bad1a (l10n updates) [13:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [14:15:27] !log admin [codfw1dev] added domain-wide 'designateadmin' and 'observer' roles to codfw1dev-proxy-dns-manager service account T295246 [14:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:15:31] T295246: Dynamicproxy API should be useful without the Horizon dashboard - https://phabricator.wikimedia.org/T295246 [14:21:43] !log paws Deploying minesweeper [14:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [16:04:29] !log paws updating pywikibot 2fc27c99db35515121f5b80c9f3df5e648e56b58 [16:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [16:08:24] !log devtools created devtools.wmcloud.org dns zone for the devtools project T301793 [16:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL [16:08:28] T301793: Add wmcloud.org DNS zone to WMCS project devtools - https://phabricator.wikimedia.org/T301793 [17:32:50] hello Cloud Services! Starting around 2022-02-10, Pageviews Analysis (and it's bundled apps) has been running rather slow. It seems there's an issue with some requests to the backend timing out or failing entirely. The backend is just cheap PHP scripts that query the replicas and/or tools-db. But as far as I can tell the issue is that it's not even making it that far; the backend never receives the request at all [17:33:11] task is at https://phabricator.wikimedia.org/T301649 (see also the comments) [17:33:40] I increased the number k8s pod replicas, but that made some errors go away but the symptoms are still there as far as I can tell [17:34:08] is anyone familiar or heard of an issue like this before? any tips for debugging? [17:34:51] the last paragraph in this comment I think sums up the issue best https://phabricator.wikimedia.org/T301649#7709983 [17:36:23] scrollback to crg at Feb 13, 01:17 for someone reporting pageviews slowness. idk if that's helpful. [17:36:40] (this channel) [17:37:50] yeah I saw that. They were referring to the Topviews tool which makes ~10 requests to the backend, so the issue is most evident there https://pageviews.toolforge.org/topviews [17:38:27] in short, once you see that page successfully load, you should be able to refresh and things load instantly because it pulls from APCu cache. But if you check the network log you'll see sometimes some requests just hang [17:38:50] so it's stateless? [17:39:11] aside from APC cache [17:40:24] yes that would be fair to say. There's some clientside caching on some tools, but regardless it's not something wrong with the backend, rather connecting to it [17:40:41] there were no code changes that caused this, either [17:41:01] my guess is there's been an increase in traffic and the k8s pod can only handle so much [17:41:18] what effects are you seeing exactly? just slow requests or are you getting some errors too? [17:41:51] the clientside abandons the request after ~30 secs I think. I never get a response to check, and there's nothing logging at error.log [17:43:20] if I check the network log and directly load the URI that timed out, it always loads no problem [17:44:08] and sometimes the request does successfully make it to the backend, it's just slow. It all seems like a networking issue [17:45:45] the errors at https://phabricator.wikimedia.org/T301649#7706613 made perfect sense at first, but I am no longer seeing those in error.log after I increased the number k8s pod replicas [17:46:24] have you tried querying db yourself, maybe from CLI while it's misbehaving? have you found any sequence of requests or time of day that can trigger this condition? [17:47:05] it's there connection pooling? [17:47:31] yeah if I run it on my local (with an SSH tunnel to the db replicas) it all loads fine, every time. Not seeing any issue with my other tools that talk to the replicas or tools-db, either [17:47:38] this chart shows the response timings as measured by the proxy that sits directly before the webservice pods: [17:47:38] https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1w&g0.expr=sum(nginx_ingress_controller_request_duration_seconds_sum%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)%20%2F%20sum(nginx_ingress_controller_request_duration_seconds_count%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)&g0.tab=0 [17:48:12] 499 is a nginx-specific code for "Used when the client has closed the request before the server could send a response.", so browser timeout I think [17:48:13] ah! that is very telling [17:48:56] so I'm wondering if you're suddenly getting hit by some misbehaving crawler [17:49:00] 499 sounds exactly what I'm experiencing, and I see this started on the 10th [17:49:14] yes, I was thinking I could turn on the access.log [17:49:34] see also https://phabricator.wikimedia.org/T226688 :) [17:49:43] heh [17:49:48] I'm grepping the front proxy logs atm [17:50:42] okay thank you! I have a meeting here in 10 mins, so I will have to depart for about an hour. I feel like we're on to something, though! thank you for the help so far :) [17:54:34] I just turned on the access.log. In my experience with VPS tools that have the XFF header enabled, most web crawlers have human-like UAs, so it may be difficult to infer which traffic is organic and which is not [17:55:05] nothing obvious that I can see :( even the most active individual IPs are only at couple thousand requests per day, those are obviously crawlers too per the IP addresses but don't explain the sudden jump in the graphs [17:55:08] also most web crawlers I don't think load JavaScript, right? that was my theory as to why Pageviews was immune to the whole web crawler problem, up until now [17:56:33] one thing you could try to reduce load on the kubernetes pods would be to move images, localization files, etc static content from tools-static, https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Static_file_server [17:57:57] okay I will try that, seems like that couldn't hurt [17:59:05] I haven't noticed any static assets failing to load so far, only the AJAX requests. But again it can't hurt load assets via tools-static [18:16:26] jesus CHRIST [18:16:27] background-runner-8479db69c7-ldlnc 0/1 CrashLoopBackOff 12845 124d [18:16:41] 12845 restarts in that pod 😬 [18:17:03] that reminds me to finish off my thing to automatically kill off things that have been in CrashLoopBackOff for a while [18:17:17] @lucaswerkmeister I think taavi was working on some logic to deal with that [18:18:12] !log tools.quickcategories updated venv (includes mwparserfromhell 0.6.4) and recreated background-runner deployment from scratch (it had been stuck in CrashLoopBackOff at 12845 restarts, let’s see if this helps) [18:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [18:19:55] nope, it’s still restarting [18:20:23] !log paws added psp for minesweeper [18:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [18:21:20] !log tools delete tools-package-builder-03 [18:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:21:38] !log tools.quickcategories deleted background-runner deployment again, I have a stack trace, let’s see if I can make sense of it [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [18:38:17] !log tools Disabled puppet on tools-proxy-05 for manual testing of nginx config changes [18:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:40:59] !log tools.quickcategories deployed 3db46cd631 (update for cachetools 5.0.0), reset all pending jobs to planned, recreated background runner [18:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [18:43:36] !log tools Enabled puppet on tools-proxy-05 [18:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:04:02] !log tools.quickcategories deployed 6fc4cf0508 (recognize interwiki titles) with background runner restart, seems to be going better now [19:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [19:05:07] !log tools.quickcategories restarted webservice with `kubectl rollout restart deployment quickcategories`, the `webservice restart` apparently didn’t work? (pod age was still 125d) [19:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [19:06:26] !log tools.quickcategories deployed bb178f29b2 (missing template files) [19:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [19:16:36] !log tools.quickcategories deployed 3627eb9988 (fix query) [19:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [19:19:03] *slaps roof of tool* this bad boy can fit so many bugs in it [19:20:07] !bash *slaps roof of tool* this bad boy can fit so many bugs in it [19:20:08] taavi: Stored quip at https://bash.toolforge.org/quip/f7rT_n4Ba_6PSCT96NHu [19:20:14] lol [19:21:27] (now the question is, can I be annoyed at my users for not telling me they had issues with the tool, or is it my fault for not looking into the logs) [20:08:27] !log tools.wd-shex-infer restarted webservice (had been crashing with “too many open files” since 2022-01-23) [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-shex-infer/SAL [20:11:22] hm, are there known issues with gridengine uwsgi-plain tools at the moment? https://wd-shex-infer.toolforge.org/ is showing the standard 503 error page but the webservice claims to be running with no errors in uwsgi.log [20:11:50] (I would expect the tool to still target the old Stretch grid for now) [20:12:08] aha, now it’s working [20:12:16] guess I just forgot how long it takes to come back up 🤷 [20:33:50] !log tools.trusty-tools Kill webgrid-lighttpd job that should not have been running [20:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.trusty-tools/SAL [20:54:58] beta seems to be down [21:01:10] mutante: I see dancy poking at some things that might be related in #wikimedia-releng [21:01:43] bd808: ACK, just pinged as well a second ago :) [21:21:36] @andrewbogott do you think the nfs mount changes at wikidumpparse broke https://humaniki.wmcloud.org/ ? [21:21:58] who's in that horizon project now? [21:22:29] It might have broken it but it might also be that it just didn't survive a reboot. What in particular is broken? (url works for me) [21:22:39] anyone know the best way to get ahold of Max Klein? [21:23:02] oh, I see, something on the api server... [21:23:18] idk, exactly, waiting on reply about that. [21:23:19] but at first glance: [21:23:21] NaN% and Unreachable : "https://humaniki.wmcloud.org/api/v1/gender/gap/latest/gte_one_sitelink/properties?label_lang=en" (re @wmtelegram_bot: It might have broken it but it might also be that it just didn't survive a reboot. What in particular is broken? ...) [21:23:45] any idea which server serves that api? [21:24:10] I think it's the same domain as the website? [21:24:13] idk [21:26:07] is there a service.template keyword to target the buster grid with a gridengine webservice? [21:27:10] hmmmm I wonder if I clobbered an existing mount [21:27:15] (I told the person that reported it to join here, no reply yet) (re @jeremy_b: idk, exactly, waiting on reply about that. [21:27:15] but at first glance: [21:27:16] NaN% and Unreachable : "https://humaniki.wmcloud.org/api/v1/ge...) [21:27:41] nope, I didn't [21:28:02] so that api is redirected to a service at /srv/humaniki/humaniki-backend/humaniki.sock [21:28:12] which is probably not running [21:28:18] looks like it’s release judging from the webservice code :) (re @lucaswerkmeister: is there a service.template keyword to target the buster grid with a gridengine webservice?) [21:28:46] does it just not start up all the way on boot? [21:28:55] no idea, I'm just guessing [21:29:15] right me too 😃 [21:29:15] i mean, that sock file doesn't exist. So that suggests it did not start up :) [21:29:23] right [21:29:32] maybe a clue in bash history? [21:29:49] it's started by 'gunicorn.sh' [21:30:25] @lucaswerkmeister: yeah, I think adding a `release: buster` value to service.template should work. [21:30:26] there, did I fix it? [21:30:48] maybe we need something like chaos monkey [21:31:44] jeremy_b, is it working now? [21:31:47] I asked. looks better to me (re @wmtelegram_bot: there, did I fix it?) [21:31:53] !log wikidumpparse started 'gunicorn' service on humaniki-backend. Apparently it doesn't start itself on reboot. [21:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidumpparse/SAL [21:32:20] ok, I'm withdrawing for now :) I suggest you open a bug or something about getting that service to auto start. [21:32:38] I never heard of the service before so idk what it's supposed to look like [21:32:43] @jeremy_b: mostly we need an easier to use automation system for Cloud VPS instances than "make a patch to ops/puppet.get and hope someday it gets merged" so that folks actually automate their instances config and startup [21:32:48] 👍 I agree. thanks [21:33:22] bd808: I think in this case it already has a hand-made service module so it's probably a one-liner to make it autostart [21:33:39] oooh, pip on the stretch bastion has fancy colorful progress bars ^^ [21:34:31] andrewbogott: *nod* and good for them to be that close to automated [21:34:49] agreed! I was expecting 'run in screen' [21:35:09] hah lol [21:35:10] I fixed one like that a month or so ago. so gross [21:35:27] nohup > screen? [21:35:31] tmux??? [21:36:02] * bd808 waves a trout near @jeremy_b [21:36:30] systemd unit provisioned by Puppet is the correct answer ;) [21:36:41] farmed or wild? whole, fillet, or smoked? hot or cold smoke? [21:37:00] hmm, my webservice seems to be running on tools-sgewebgrid-generic-09092, which says Debian 9 / stretch in /etc/os-release [21:37:15] so I think the release isn’t working right… [21:37:19] (wd-shex-infer tool) [21:38:38] the service.manifest, updated just now, has release: stretch [21:39:22] @lucaswerkmeister: Hmm... so the template did not work? [21:39:31] as far as I can tell no [21:39:40] * bd808 goes to look at code [21:40:02] could webservice be missing a release line in the argparser.set_defaults call around line 328? [21:40:24] Who wants to write a regex for me? Given a string-with-an-unknown-number-of-dashes I want to truncate at the last dash, producing 'string-with-an-unknown-number-of' [21:40:32] it adds a --release={} arg to tmpl_argv but then doesn’t get it out of the parsed template_args again, I suspect [21:41:02] raises hand (re @wmtelegram_bot: Who wants to write a regex for me? Given a string-with-an-unknown-number-of-dashes I want to truncate at the last...) [21:41:17] @lucaswerkmeister: yeah. you found the bug. Do you want to send a patch or have me work on it? [21:41:23] I can send it, sure [21:41:32] trying to test it on a local webservice copy now [21:41:36] what's an example before truncation? [21:42:06] does it always end with not a dash before and after truncation? [21:42:17] does it always have at least one dash to start? [21:42:36] what language is this? [21:43:00] ruby, and it will always contain at least one dash [21:43:04] regex is probably the wrong tool unless the language is awk :) [21:43:37] the source is a group of hostnames with a count postfix, like 'cluster-member-1' and 'cluster-member-2' and what I want is to isolate the prefix 'cluster-member' [21:43:57] bd808: if you know how to do it in puppet w/out a regex I'm all ears :) [21:44:07] perl -lpe 's/-[^-]*$//;' [21:44:13] idk Ruby [21:44:17] * andrewbogott tries it [21:45:48] writing regex on a phone is weird:) [21:46:02] bd808: https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/762949 [21:46:24] https://www.hudsonvalleyfisheries.com/ [21:46:25] this is farmed. some of the smoked goes for $48/lb at the farmer's market. I don't get how they get people to pay that. [21:47:22] !log tools.wd-shex-infer deployed ef7416d9c1 (move to Buster grid); I tested one job on this grid manually and it still seemed to work, let’s hope for the best [21:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-shex-infer/SAL [21:49:33] !log tools.wd-shex-infer deployed 16ff031278 (remove redirect_to_https call) [21:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-shex-infer/SAL [21:51:19] andrewbogott: In python I would do "-".join("string-with-an-unknown-number-of-dashes".split("-")[:-1]) but yeah do what you gotta do in puppet/ruby land. [21:51:57] counting with regex is not a nice thing [21:52:34] even in python I might just use my regex from above [21:52:49] it works, btw, thanks jeremy_b! [21:53:09] sure. I like regexes :) [21:53:43] I knew someone here would :) [21:53:51] good bet [22:47:22] bd808: better to do "string-with-an-unknown-number-of-dashes".rsplit("-", 1)[0] :-) [22:50:45] !log tools Built new toollabs-webservice 0.81 [22:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:58:06] !log tools `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81 [22:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:58:33] @lucaswerkmeister: your fix should be live everywhere now [22:58:47] cool, thanks! then I’ll remove my local webservice copy again [22:58:52] * bd808 had not done that build and deploy dance for a long time [23:06:56] !log tools Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01 [23:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:17:37] !log tools Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over. [23:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL