[01:15:28] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) Perfect, thank you for the guidance. I'm reading the docs on the two projects and agree both would work, Impyla seems to ha... [01:24:16] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) Sounds good! [02:05:46] (03CR) 10Neil P. Quinn-WMF: "Congratulations on the adding the first new-style schema! 😊" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [02:06:26] (03PS2) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [02:07:03] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [02:14:09] (03PS3) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [02:14:47] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [02:14:53] (03PS4) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [02:14:59] (03PS1) 10Andrew Bogott: run test_output.py through Black [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716747 [02:15:39] (03CR) 10jerkins-bot: [V: 04-1] run test_output.py through Black [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716747 (owner: 10Andrew Bogott) [02:15:42] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [02:17:18] (03PS2) 10Andrew Bogott: run test_output.py through Black [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716747 [02:17:20] (03PS5) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [02:18:48] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [02:19:08] (03CR) 10Andrew Bogott: [C: 03+2] run test_output.py through Black [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716747 (owner: 10Andrew Bogott) [02:20:05] (03Merged) 10jenkins-bot: run test_output.py through Black [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716747 (owner: 10Andrew Bogott) [02:38:31] (03PS6) 10Andrew Bogott: Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 [02:39:07] (03CR) 10jerkins-bot: [V: 04-1] Added minimal page load test for '/' route [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716558 (owner: 10Andrew Bogott) [02:53:32] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Chlod) [03:01:03] (03PS1) 10Andrew Bogott: query.py: fix a couple of url_for calls [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716793 [03:02:15] (03CR) 10Andrew Bogott: [C: 03+2] query.py: fix a couple of url_for calls [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716793 (owner: 10Andrew Bogott) [03:02:55] (03Merged) 10jenkins-bot: query.py: fix a couple of url_for calls [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/716793 (owner: 10Andrew Bogott) [03:05:51] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Chlod) a:03Andrew [03:53:52] 10Analytics, 10Data-Engineering, 10Event-Platform: Discussion of Event Driven Systems - https://phabricator.wikimedia.org/T290203 (10Milimetric) >>! In T290203#7329419, @daniel wrote: > I made this doodle of an "event driven mediawiki" architecture a while ago. I had forgotten about this, but listening the "... [04:18:28] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:31:20] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:33:06] 10Analytics-Radar, 10Product-Analytics (Kanban): [REQUEST] Investigate decrease in New Registered Users - https://phabricator.wikimedia.org/T289799 (10Tgr) In theory an account gets autocreated for all new users on metawiki and loginwiki (although this is also something that could break in theory, but stewards... [06:43:32] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10awight) From what I can see, support for producing Schema:EditConflict wa... [07:29:56] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10awight) I'd like to see some discussion about the `$wgPingback` defaults.... [08:21:01] hnowlan: good morning - I have a request to deploy for aqs new hosts (now that you've merged the scap deploy-list change) - I know we're friday, and I'd rather have your validation and monitoring - Please :) [08:29:32] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10Michael) [08:31:14] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Michael) [08:31:48] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10Michael) [08:36:34] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10Michael) [08:39:46] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10awight) [08:40:10] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10awight) [08:41:06] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10awight) [08:49:03] I'm running into this new error when running the legacy event schema migration script: > Error: No resolver found for key client_dt. [08:51:15] joal: if you need a hand I can follow the deployment as well (in theory it should be fine for the new hosts, no impact to the current cluster) [08:51:39] Hi elukey - thanks a lot for offering :) [08:52:20] awight_: I'm sorry I have no idea on how to help on this - I think it'd be better to wait for Andrew (he works today I think) [08:52:41] elukey: Then with your approval, I'll go and deploy aqs code on new hsots [08:53:15] joal: +1 from me, I guess that they are in a separate scap env right? [08:53:59] ah no all in the same https://gerrit.wikimedia.org/r/c/analytics/aqs/deploy/+/715995/1/scap/aqs-prod [08:54:14] it should be safe as long as we're deploying *only* to them :) [08:54:20] but we can limit the hosts scap looks at [08:54:27] nope they-re not elukey - I need to rely on -limnit [08:54:51] hnowlan: o/ I'll leave it to you sorryyyy I thought you were afk :) [08:55:19] scap deploy -l aqs1010.eqiad.wmnet aqs1011.eqiad.wmnet aqs1012.eqiad.wmnet ... [08:55:22] hnowlan: --^ [08:55:29] ? [08:55:34] didn't know you could pass a list [08:56:02] but if so then it seems good, maybe start with only one to see if anything explodes [08:56:09] actualy maybe it's better with: scap deploy -l aqs101[012345].eqiad.wmnet [08:56:51] elukey: aqs1010 being canary, I'll go with the above, check on aqs1010 when ready (if working), and then stop or proceed for the rest [08:57:01] hnowlan (as well) --^ [08:59:06] joal: does it work even with -l ? [08:59:17] hm - good question! [08:59:24] I thought that the canary thing was only for a "Regular" scap deploy [08:59:29] let's force a manual signel deploy to aqqs1010 then ) [08:59:49] I think -l takes a regex [08:59:54] this is why I was wondering about a separate scap env (like we have in refinery for hadoop-test etc..) [09:00:26] yeah maybe a separate env makes sense even for the short term [09:00:47] at the end we drop the old one and that's it [09:01:14] do you folks prefer me to wait and do it when the new env is ready? [09:01:46] joal: that'd probably be best, writing the change now [09:01:53] hnowlan: <3 [09:02:01] ack hnowlan - thanks a lot for following up :) [09:02:11] thanks elukey as well <3 [09:05:21] (03PS1) 10Joal: Fix mediarequest top cassandra3 loading jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/717174 (https://phabricator.wikimedia.org/T290068) [09:07:09] (03CR) 10Joal: [V: 03+2 C: 03+2] "Self merging for hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/717174 (https://phabricator.wikimedia.org/T290068) (owner: 10Joal) [09:07:47] !log Deploying refinery to hotfix mediarequest cassandra3 loading jobs [09:07:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:12:05] (03PS1) 10Hnowlan: Move new aqs hosts to aqs-next for test deploys [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/717181 [09:12:22] !log Rerun mediawiki-history-denormalize-wf-2021-08 after failure [09:12:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:14:13] (03CR) 10Joal: [C: 03+1] "LGTM but I don't know scap config syntax, so someone else should review :)" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/717181 (owner: 10Hnowlan) [09:18:21] (03PS2) 10Jgiannelos: Map tile state change event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) [09:19:39] (03CR) 10Jgiannelos: Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [09:20:36] (03CR) 10Jgiannelos: Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [09:38:17] (03CR) 10Elukey: [C: 03+1] "LGTM!" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/717181 (owner: 10Hnowlan) [09:44:15] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] Move new aqs hosts to aqs-next for test deploys [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/717181 (owner: 10Hnowlan) [09:45:54] !log Kill-restart mediarequest-top cassandra loading jobs after deploy [09:45:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:53:45] hnowlan: I assume you've merged the patch and that mean I can now deploy using the new environment? [09:56:09] joal: it only needs another git pull on deploy1002 and then you are set [09:56:14] scap deploy -e aqs-next [09:56:47] ok I'll test that - I'm always in favor of getting a +1 from an SRE before deploying :) [09:56:50] thanks elukey [09:57:23] !log Deploy AQS on new AQS servers [09:57:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:58:14] oh heh I am deploying it right now :) [09:58:14] Ah! actually hnowlan is already doing it :) thanks a lot hnowlan :) [09:59:20] ahh deploy is failing possibly because of empty data [09:59:22] Check 'endpoints' failed: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 404 (expecting: 200); /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views [09:59:28] returned the unexpected status 404 (expecting: 200); [09:59:48] hnowlan: I think we miss the test data in the table - let me fix that [10:00:37] ahh ok [10:00:54] hnowlan: done - can you please retry? [10:01:39] joal: ack, doing it now [10:02:56] joal: just one failure now (only on non-canary host interestingly): "Check 'endpoints' failed: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 404 (expecting: 200)" [10:03:48] hm - can be related to non-deterministic results from cassandra [10:04:22] I've been hitting this every now and then during my tests as the cluster is a bit under pressure (meaning, no result given by table while result eists) [10:04:30] aha [10:04:45] trying again [10:05:13] deploy was successful on the same host but failed on another [10:05:42] right - I'm gonna stop my test, it should make the cluster a lot more stable [10:06:25] there still is some read pressure AFAICS, but it's not me this time :) [10:07:28] heh [10:08:39] Ah no my bad - aqs1010 is fine - the old cluster receives pressure [10:19:25] hnowlan: shall I try anew? [10:20:17] joal: hold for a sec, trying to debug what's going on atm [10:20:20] deploys are still failing [10:20:24] ack hnowlan [10:20:27] but usually for a single host [10:20:34] hnowlan: let me know if there is anything I can help with [10:20:41] I assume this isn't an old pattern :) [10:21:01] hnowlan: it has not happened for us this way before, I don't think [10:25:00] In a more confusing development, none of the new hosts are logging to logstash [10:27:15] hnowlan: I found AQS logs in logstash! [10:29:18] joal: Oh? For the new hosts? [10:29:34] yessir! [10:30:05] hnowlan: with filter service.type = aqs [10:30:23] I see the warn events of the restart after deploy [10:34:59] hnowlan: it seems that the test data I inserted has not been written - when querying manually for it I get no result [10:35:16] That's weird, cause I definitely imported the whole lot [10:35:57] hnowlan: do you wish I insert it anew? [10:36:17] (03PS3) 10Jgiannelos: Map tile state change event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) [10:37:26] joal: please do [10:37:43] joal: I have manually deployed the latest aqs on all hosts but ^that sounds very worrying [10:38:49] done hnowlan - data is readable now [10:39:08] It's as if my insertion had worked for most but not all rows :( [10:41:55] hnowlan: I confirm your deploy has worked on aqs1010 - my problem there is fixed [10:43:40] joal: cool - all hosts should be running the same version now [10:44:07] thanks a lot for making this happen - do ou wish I try a deploy to see if my data insertion has fixed the deploy issue? [10:44:20] what consistency are you writing the example data with? [10:44:30] default, meaning local I assume [10:44:39] joal: couldn't hurt to try again [10:44:47] hnowlan: trying now [10:46:13] wow - interesting - when there is an -e specify in scap but the env isn't existing, scap defaults to the default one - I think it should error [10:46:24] eeek [10:46:25] yeah [10:47:18] hnowlan: deploy successful - man - what a mess - sorry for that :S [10:47:57] Well that's at least some comfort :) [10:48:21] but the fact the data write might not have worked the first time is real worrying [10:50:02] agreed hnowlan - particularly without any type of error [10:50:31] hnowlan: maybe asking for consistency quorum would be safer when writing [10:50:39] It'd cost more, but it'd be safer [10:51:55] I'm worried if the lack of error means some larger hidden issue with the new cluster - it seems a bit overly paranoid but those writes *should* have propagated quite quickly [10:52:02] with local consistency [10:52:16] yes - agreed [10:52:32] not like that cluster is busy [10:53:09] hnowlan: it was at the time I wrote the data - I was also doing some querying at the same time (nothing huge though) [10:53:28] even so, it should be well able to handle both... [10:53:36] I'll see if eric has any insight on potential risks/verifications [10:54:07] ack [10:54:15] thank you for the help hnowlan :) [10:54:17] is there an easy way to sample stuff from the hourly jobs so that we could maybe check every instance soon after import for the presence of the data? [10:54:23] no worries, hope we can get this right! :) [10:54:37] hnowlan: I have some script doing exactly that [10:55:53] ah nice! [10:58:53] has that script seen any inconsistency so far? [11:01:37] hnowlan: I test 10M rows on pageview_per_articles and mediarequest_per_file --> ~10 inconsistency from old cluster(no res in old) with pageviews, none in mediarequest [11:02:11] And I triple checked manually on old cluster the inconsistent rows --> manual query was returning data [11:02:21] I think we're ok [11:03:16] Now I'm experimenting an interesting one - I have reloaded some error-data, and cassandra still see the old one [11:07:48] I wonder how long a write will take when consistency is set to ALL [11:12:50] I don't know [11:29:32] actually my last concern was not true - there was yet another iisue at load time, cassandra was doing its job correctly - trying to fix [11:39:02] I finally nailed it :) [11:42:22] (03PS1) 10Joal: Fix mediarequest top cassandra3 loading jobs fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/717291 (https://phabricator.wikimedia.org/T290068) [11:42:39] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for hotfix" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/717291 (https://phabricator.wikimedia.org/T290068) (owner: 10Joal) [11:43:18] !log Deploying refinery to hotfix mediarequest cassandra3 loading jobs (second) [11:43:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:14:18] !log Kill-restart mediarequest-top cassandra loading jobs after deploy (bis) [13:19:46] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [13:20:29] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [13:21:45] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10mdipietro) This may be related to fawiki_p, seems to leave jobs queue or running even when they are short. If the job is queued the stop function will likely fail, as it won't find a job to stop. We... [13:37:08] (03CR) 10Ottomata: "Great, a couple of nits, but LGTM otherwise!" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:43:04] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Andrew) This is probably fixed by https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/716793 but it might be only partial -- can you retest? [13:55:36] (03CR) 10Mholloway: Map tile state change event schema (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [13:58:36] 10Quarry: quarry giving deprecated notices - https://phabricator.wikimedia.org/T289871 (10mdipietro) 05Open→03Resolved [13:59:22] 10Quarry: quarry giving deprecated notices - https://phabricator.wikimedia.org/T289871 (10mdipietro) The notice appears to have vanished with a separate commit. Closing. Reopen if notice reappears. [13:59:34] 10Quarry: quarry giving deprecated notices - https://phabricator.wikimedia.org/T289871 (10mdipietro) a:03mdipietro [14:04:17] (03CR) 10Mholloway: Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:08:26] (03CR) 10Mholloway: Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [14:10:16] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Chlod) I tried executing four queries to test the changes ([[ https://quarry.wmcloud.org/query/56472 | 56472 ]], [[ https://quarry.wmcloud.org/query/48083 | 48083 ]], [[ https://qu... [14:54:32] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Andrew) I'm sorry this is misbehaving. I just tried re-running one of your queries and it worked: https://quarry.wmcloud.org/query/58317 This has me still thinking that this is... [15:02:44] 10Quarry: celery version six preparation - https://phabricator.wikimedia.org/T290328 (10mdipietro) [15:22:05] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Andrew) Current theory is that this is related to database timeouts, which we adjust shortly [15:22:41] (03PS1) 10Michael DiPietro: update config to match for celery 6 [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717443 (https://phabricator.wikimedia.org/T290328) [16:30:10] Hi ottomata: I have not rerun the failed refine job :) [17:15:11] joal: huh the data looks present and fine [17:15:24] wewird ottomata [17:15:37] i reran but it said no data needed refeinement [17:15:49] mforns: see email i just sent about monitor refine sanitize [17:16:03] the reason for the alerts is that delayed is backfilling https://gerrit.wikimedia.org/r/c/analytics/refinery/+/713570 [17:35:17] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10Huji) I did not understand about half of what you said! You are clearly the expert, so I defer to you on how to handle this. [17:43:05] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Andrew) Hello again @Chlod . We've adjusted some timeouts which were probably the cause of the queued/running-forever issue. Since those are orphaned queries now they will probabl... [17:56:52] ottomata: thanks for launching the backfilling! [17:57:13] ottomata: still it doesn't make sense to me that it was failing, though... [17:57:59] ottomata and joal: it was me who re-ran the refine job, sorry for not remembering to respond to the email [17:58:36] looking into the mediawiki-history-denormalized error [18:45:03] oof, IRC had signed me out and I didn't notice [18:45:17] did anyone look at mw history? It failed again after jo reran it [18:45:50] (03CR) 10Mforns: "@Neil, thanks for your privacy analysis! :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [18:49:27] mforns: oh! you just reran it (mw history). Jo had rerun it this morning, did you find out what was wrong? [18:49:54] oh! milimetric... no, I thought it was the first time... [18:50:02] sorry [18:50:12] I was just searching and it said "JA018" means "output directory exists" so I was thinking maybe he forgot to delete the failed output and it failed when rerunning [18:50:17] np at all, glad I caught you [18:50:27] ok, so do you have the logs from this last failure? [18:50:53] not yet [18:51:39] bc real quick? [18:52:49] omw milimetric [19:00:49] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10MarioGom) I'm experiencing the same issue with enwiki_p. I have one job stuck in "running", another stuck in "queued", and stop button gives error 500 for both of them. [19:22:12] 10Quarry, 10cloud-services-team (Kanban): Quarry is degraded/partially inaccessible - https://phabricator.wikimedia.org/T290291 (10Bstorm) [19:39:22] 10Analytics, 10Event-Platform, 10Metrics-Platform, 10Patch-For-Review: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Mholloway) Here's a pleasant PHP surprise: associative arrays in PHP //[[ https://www.php.net/manual/en/language.types.a... [20:33:32] 10Quarry: Close quarry db autocompletion on tab - https://phabricator.wikimedia.org/T289872 (10mdipietro) a:03mdipietro [20:37:30] 10Analytics, 10Event-Platform, 10Metrics-Platform, 10Patch-For-Review: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Mholloway) To expand on my dislike of the reliance on ordering: Before I realized that PHP associative arrays were order... [20:39:55] (03PS1) 10Michael DiPietro: close quarry db dropdown on tab [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717592 (https://phabricator.wikimedia.org/T289872) [20:42:21] 10Analytics, 10Event-Platform, 10Metrics-Platform, 10Patch-For-Review: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) Nice! [20:46:57] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10mdipietro) I suspect this is a combination of a new problem and an old problem. The new problem is that the stop function doesn't consider a job in the "queued" status, it needs different logic than... [20:54:06] (03CR) 10Andrew Bogott: [C: 03+1] close quarry db dropdown on tab [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717592 (https://phabricator.wikimedia.org/T289872) (owner: 10Michael DiPietro) [20:59:31] (03CR) 10Bstorm: [C: 03+1] "Seems to work great locally." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/717443 (https://phabricator.wikimedia.org/T290328) (owner: 10Michael DiPietro) [21:09:36] 10Analytics, 10Data-Engineering, 10FR-Tech-Analytics, 10Privacy Engineering: event.WikipediaPortal referer modification - https://phabricator.wikimedia.org/T279952 (10sguebo_WMF) >>! In T279952#7329494, @mforns wrote: > @sguebo_WMF & @EYener, we discussed this task and will go ahead and implement this feat... [21:40:16] (03PS1) 10Clare Ming: POC: add new stream for VectorPrefDiffInstrumentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/717622 (https://phabricator.wikimedia.org/T289622) [21:41:22] (03CR) 10jerkins-bot: [V: 04-1] POC: add new stream for VectorPrefDiffInstrumentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/717622 (https://phabricator.wikimedia.org/T289622) (owner: 10Clare Ming) [22:08:10] (03PS2) 10Clare Ming: POC: add new stream for VectorPrefDiffInstrumentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/717622 (https://phabricator.wikimedia.org/T289622) [22:08:19] (03CR) 10jerkins-bot: [V: 04-1] POC: add new stream for VectorPrefDiffInstrumentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/717622 (https://phabricator.wikimedia.org/T289622) (owner: 10Clare Ming) [22:13:22] (03PS3) 10Clare Ming: POC: add new stream for VectorPrefDiffInstrumentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/717622 (https://phabricator.wikimedia.org/T289622)