[04:55:04] 10Analytics-Radar, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10wiki_willy) 05Open→03Declined Resolving this racking task, since the project has been pushed back to Q3. [05:03:52] (03PS4) 10Ladsgroup: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) [05:08:32] (03PS5) 10Ladsgroup: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) [05:08:35] (03CR) 10Ladsgroup: Add script to get some data out of wb_changes (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [05:08:46] (03CR) 10Ladsgroup: "tested in stat1007. Works fine." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [06:28:31] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:36:55] 10Analytics, 10Data-Engineering: Analytics-test-hadoop Spark3 package upgrade - https://phabricator.wikimedia.org/T291465 (10JAllemandou) Question on this decision: This seems to force the upgrade to spark3 to be part of the bigger upgrade to BigTop3, that is altogether a quite big beast (hadoop2->3, most noti... [06:38:42] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) >>! In T291472#7391810, @BTullis wrote: > The first attempt at loading failed with an out-of-memory error. Mwa... [06:39:37] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:55:41] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) `/usr/bin/sstableloader` is a shell script and these are the last 10 lines. ` btullis@aqs1010:~$ tail /usr/bin/sst... [09:11:38] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) That has worked, so loading is now under way. The `org.apache.cassandra.tools.BulkLoader` task is using 4.5 GB of R... [10:50:12] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) I've been exploring how to verify categorically from the logs that the repair is successful. We can see here that the number of times `ful... [10:54:17] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10JAllemandou) \o/ Thanks a lot for this finding @BTullis :) [12:02:48] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10jcrespo) I've noticed that since around 16:02-16:06 yesterday, there is a large amount (50% increase in logge... [12:04:26] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) We haven't deployed anything related to this ticket yet but it's probably something with Elastic.... [12:05:56] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10jcrespo) p:05Triage→03High Setting to high because almost half a million extra errors per hour. I will re... [12:14:50] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10Ladsgroup) I'm not sure if these are related. It's a problem for sure but we haven't deployed anything relate... [12:19:42] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10jcrespo) > I suggest creating a dedicated ticket and tagging search platform team. Ok. [12:19:57] 10Analytics-Radar, 10Event-Platform, 10Wikibase change dispatching scripts to jobs, 10Wikimedia-JobQueue, and 2 others: Queuing jobs is extremely slow - https://phabricator.wikimedia.org/T292048 (10jcrespo) p:05High→03Triage [13:10:27] team: slack is down for me today (it seems to be for a specific ISP in France, obviously the one I use) [13:44:42] (03CR) 10Michael Große: [C: 03+2] Add script to get some data out of wb_changes (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [13:46:51] (03Merged) 10jenkins-bot: Add script to get some data out of wb_changes [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/723613 (https://phabricator.wikimedia.org/T291276) (owner: 10Ladsgroup) [13:57:45] btullis: quick heads-up - have you checked the refine-failure alert email received today around 1am UTC? [14:01:00] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10JAllemandou) I have added some findings about how Gobblin generates/defines... [15:03:23] Hi joal. No, not yet. Sorry. [15:04:11] I will look at it now. [15:10:27] Checking the destination directory on HDFS before re-running the job. [15:10:31] https://www.irccloud.com/pastebin/0jfEQIuq/ [15:11:44] Re-running the job with the parameters given in the email. [15:11:50] !log sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='editoractivation' --since='2021-09-29T22:00:00.000Z' --until='2021-09-30T23:00:00.000Z' [15:11:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:16:04] Hmm. Job is shown as succeeded, but I still have a _REFINE_FAILED flag in that directory. https://yarn.wikimedia.org/cluster/app/application_1632476005296_28611/ [15:16:18] https://www.irccloud.com/pastebin/Wl59HWiu/ [15:19:29] I have the same error from that was originally reported in the email: `Original exception: java.lang.RuntimeException: Could not extract /$schema field from event, field does not exist` - Not sure how to proceed with this. [15:21:36] there should be an option to run the script dropping the events that don't validate, but it is better to quickly check via pyspark the content of the hour that fails to visualize the data and see if it is due to few records or something more horrible :D [15:22:14] I added a note in https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Spark about how to read raw files [15:24:13] Thanks elukey - Actually, now I look at it, the error is not precisely the same. [15:24:24] I'll check out how to read it via spark. [15:24:47] my code is probably a horrible way to do it but it worked IIRC :D [15:27:49] (03PS1) 10Bearloga: Movement Metrics: update main.sh [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/725334 (https://phabricator.wikimedia.org/T291958) [15:28:44] (03CR) 10Bearloga: [V: 03+2 C: 03+2] Movement Metrics: update main.sh [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/725334 (https://phabricator.wikimedia.org/T291958) (owner: 10Bearloga) [15:30:13] I'm fumbling around in the spark2-shell command with that code, but I don't really know what I'm doing. Sorry. [15:31:35] I tried this, based on the raw source directory: `val rdd = sc.sequenceFile[org.apache.hadoop.io.LongWritable,String]("/wmf/data/raw/eventlogging_legacy/eventlogging_EditorActivation/year=2021/month=09/day=30/hour=21")` [15:32:02] But then `rdd.map(x => x._2).take(100)` told me that none of the files was a sequence file. [15:33:10] (03CR) 10Bearloga: ETL test notebook (031 comment) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/724469 (https://phabricator.wikimedia.org/T291958) (owner: 10Bearloga) [15:35:31] btullis: ah yes inside "hour=21" there are multiple files [15:35:50] mmmm [15:35:53] weird though lemme check [15:36:49] Apologies, but I have to be afk for a while now. [15:38:29] part.task_eventlogging_legacy_1633036214387_86_1.txt.gz [15:38:32] this is weird: D [15:42:42] or better I don't recall .txt.gz files [15:47:04] ah perfect it can be read via zless, nothing special [16:23:43] 10Analytics, 10Product-Analytics, 10Editing-team (Tracking): Add MariaDB replicas to Superset - https://phabricator.wikimedia.org/T291195 (10ppelberg) >>! In T291195#7380492, @Milimetric wrote: >>>! In T291195#7363517, @ppelberg wrote: >> @Milimetric: do you have a sense for whether it would be realistic for... [16:46:58] 10Analytics-Clusters, 10SRE, 10ops-eqiad: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Cmjohnson) Replaced the cable but still don't have access, this server will require me to power it off and drain flea power. That has been the standard fix for... [17:13:10] I'm back now, incase there is anything more that I can do to help with this refine issue. [17:39:49] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) @robh confirmed, they only have 2 disks. I'm not sure what the next step is for them [17:46:42] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) So I'm now reviewing the entire purchase history of this request. T286517 was filed, for config C-1G which is only 2... [17:49:21] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) thanks! @RobH [17:49:43] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [17:52:18] (03PS1) 10Nettrom: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) [18:06:06] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by robh@cumin1001 for host an-db1001.eqiad.wmnet [18:06:43] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05Open→03In progress [18:07:20] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by robh@cumin1001 for host an-db1001.eqiad.wmnet exe... [18:08:12] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-db1001.eqiad.wmn... [18:10:11] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [18:40:37] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-db1001.eqiad.wm... [19:07:43] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-db1001.eqiad.wmnet', 'an-db1002.eqiad.wmnet'] ` and were **ALL... [19:47:58] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) [19:48:21] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10RobH) 05In progress→03Resolved These are now ready for use! [21:09:44] 10Analytics-Radar, 10Product-Analytics (Kanban): [REQUEST] Investigate decrease in New Registered Users - https://phabricator.wikimedia.org/T289799 (10mpopov) > New editors who have some amount of edits (say 10+ in total) and aren't blocked would be a good noise-free signal for human activity. I think it coul... [22:10:47] (03CR) 10Bartosz Dziewoński: [C: 03+2] "I assume we don't need to bump the version for a documentation-only change." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725018 (https://phabricator.wikimedia.org/T290931) (owner: 10DLynch) [22:11:56] (03Merged) 10jenkins-bot: EditAttemptStep: update init_timing documentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725018 (https://phabricator.wikimedia.org/T290931) (owner: 10DLynch) [22:18:58] (03PS1) 10Gergő Tisza: Add structured_task/article/image_suggestion_interaction/1.0.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725383 [22:26:01] (03CR) 10DLynch: EditAttemptStep: update init_timing documentation (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725018 (https://phabricator.wikimedia.org/T290931) (owner: 10DLynch)