[10:37:05] gitlab needs a short maintenance break [10:43:59] GitLab is back, maintenance finished [13:30:49] hey, hopefully this is a good place to ask this question (if not, feel free to direct me somewhere else!) -- as someone who occasionally sorts new phabricator tasks into the appropriate projects, should tasks that report 5xx errors returned by Varnish be tagged with #wikimedia-production-error? [13:30:51] i just want to ask for my own information so i know what's best to do in the future, as i've seen different tasks dealt with in different ways - e.g., T385395 is tagged as a production error, but the tag was removed from T387007. [13:30:51] T385395: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395 [13:30:52] T387007: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007 [13:45:05] A_smart_kitten: https://phabricator.wikimedia.org/project/profile/1055/ has some overview of what's in scope for the tag, although it does not appear to be followed religiously [13:52:06] mszabo: thanks for the pointer :) reading what's written there, my first thoughts are that varnish 503 errors (to me) seem to fit the scope of "errors from servers operated by Wikimedia Foundation", but they're then not listed in the bullet-points that follow, so (to me reading it) it's a little unclear [14:12:03] Can someone who actually knows php review https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122578 ? [14:15:53] Heads up SRE: had failures during `check_testservers_baremetal-1_of_1` (https://phabricator.wikimedia.org/P73774) when deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123356 — retried 3 times, kept failing. Trying to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEvents/+/1123238 now to see how that goes [14:19:51] TheresNoTime: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 [14:20:43] Amir1: ^ [14:21:07] sigh [14:21:28] should I revert? [14:21:30] claime: would you like me to stop deploying? [14:21:47] TheresNoTime: Nah it's ok it's not a major issue I think [14:21:59] Keep deploying, if that's the only error you get we're fine [14:22:15] ack, and for what its worth, trying the other patch has worked so far [14:25:19] claime: re mesh availability CR, I wouldn't say I know php, but I think I know enough php to +1 that, does that count? [14:26:12] we have some 500s at the moment in wdqs (got p.a.g.e.d.) [14:28:21] claime: comment left [14:28:52] mszabo: ty <3 [14:29:58] vgutierrez ACK, checking now [14:38:49] vgutierrez still digging, but seeing a lot of errors in blazegraph logs related to MWAPI: https://etherpad.wikimedia.org/p/wdqs-500s [14:39:42] can someone with root or similar access paste some journal output for me? specifically wmde-analytics-minutely.service on stat1011 (I also asked in -operations but not sure if people will see it there) [14:39:46] apologies if this isn’t the right channel [14:40:04] Lucas_WMDE np, I can help [14:40:24] thanks! [14:43:00] Lucas_WMDE https://etherpad.wikimedia.org/p/wdqs-500s#L13 [14:43:22] hm [14:43:25] FWiW , do you get anything if you run `journalctl -u wmde-analytics-minutely.service` on stat1011? [14:43:45] I get no entries, with the usual “you are currently not seeing messages…” hint from journalctl [14:43:57] pasted it in that etherpad [14:44:09] ACK, sorry, just making sure [14:44:11] np [14:44:16] anyway, I don’t see a lot of useful info in that output :S [14:44:37] it’s launching a few scripts and there’s no indication which of them is hanging AFAICT… [14:44:46] I was hoping for an error message or something [14:44:48] but thanks anyway [14:44:52] let me see if I can run it manually [14:45:15] * inflatador wonders if something changed with MWAPI firewall rules or throttling or something? [14:46:16] I can run sudo -u analytics-wmde cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts at least [14:46:17] and it ended [14:46:20] hmm [14:46:35] yeah it showed up in graphite too [14:47:26] hm, now graphite shows *two* data points [14:47:34] even though I only successfully ran the script once [14:47:55] (I ran it once without sudo, which failed because the config file wasn’t readable, but I would’ve thought this wouldn’t write anything in graphite) [14:48:03] let’s wait a few minutes and see if it recovers… [14:48:28] wmde-analytics-minutely.service seems to be looking better again at least [14:48:53] ratio of failed queries is dropping on WDQS [14:49:34] at the risk of sounding stupid again: could this be related to the rolled back deployment? [14:49:52] I know even less than you, so don't worry about saying something stupid ;) [14:50:07] I confess I didn’t pay any attention to that [14:50:10] see -operations [14:50:35] they rolled back and created a ticket [14:50:59] Only checked one wdqs host so far, but it was spamming MWAPI connection failures up until a minute ago [14:51:07] Here's what the error rate looked like from our side https://grafana.wikimedia.org/goto/bnHoE4tNR?orgId=1 [14:51:57] I would be quite surprised if the stack trace in https://phabricator.wikimedia.org/T387461 was related to either WDQS errors or the failed wmde-analytics-minutely stats [14:52:20] yes, that... okay, nvm, I did say I might sound stupid '^^ [14:53:00] Does the timeline of the deploy/rollback match the query error graph I posted? [14:53:45] TheresNoTime: so err, I checked all debug servers, and ran httpbb on prod, and these queries are fine and 200 [14:54:04] The edge cache has a 301 cached for search=foobar but that doesn't really matter [14:54:22] issue seems to have gone away now :D but was persistent for 3 retries at the time [14:54:34] I don't understand *why* the test failed on bare metal since that should be the one that actually gets deployed by puppet [14:54:44] and doesn't need a helmfile apply to change the apache config [14:54:54] looks like the error rate starts to tick up at ~14:10 UTC and starts dropping around 14:30 UTC [14:55:18] TheresNoTime: So I'm inclided to file this under general weirdness [14:55:31] inclined* [14:56:04] looks like the Wikidata metrics are back to normal FWIW, at least in Grafana (still waiting for one Resolved email) [14:56:10] no idea what caused it [15:00:39] I know less about MW than pretty much everyone in the room, so if anyone has any theories LMK. In the meantime, I'll keep investigating/monitoring on the WDQS side [15:04:23] it sounds like the wdqs issue has stabilized, although we are not yet sure of a cause [15:05:20] I'd like to move ahead with an increment to external mediawiki traffic enrollment in PHP 8.1 - any objections to doing so while we're still investigating? [15:08:04] swfrench-wmf agreed re: wdqs...feel free to make the PHP changes [15:08:59] inflatador: great, thank you [15:35:32] just a reminder, at 1700 jasmine_ and I will be doing the switchover live test. it'll be noisy but non-disruptive hopefully [15:40:17] is clicking the "Mark all read" thing on your phab notifications not working for anyone else..? Going to an unread notification task removes that one, but clicking mark all read does nothing o.o — was working earlier today iirc [15:41:46] TheresNoTime: works for me [15:42:10] why are computers like this [17:19:09] We're about to start the live test [17:25:22] 🍿 [17:59:49] Live test complete. Some odd increases in error rate that we'll look into, but otherwise all looks okay [18:26:52] Incident report for the WDQS issues earlier is in progress ( https://wikitech.wikimedia.org/wiki/Incidents/2025-02-27_wdqs_500_errors ) . Feel free to add/edit/change anything or just ping me if I missed anything [23:02:22] no alerts during on-call shift, nothing to report [23:02:46] and ..afk