[02:23:23] !log deployed refinery with regular train [02:23:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [02:23:48] AndyRussG: yes, that is how I always understood dt (and ts) [02:24:10] milimetric: hi! cool beans, thanks much! [02:24:19] (and it's how I make sense of time_firstbyte as well, basically the response starts after dt + time_firstbyte [02:24:21] ) [02:24:26] hi :) [02:24:43] cool yeah I felt that was the most coherent explanation [02:24:52] ehhh how's it going? [02:30:17] good :) I was going to bed and hopped out just to deploy. Now I'm on some youtube rabbit hole watching electric car reviews. You? [02:37:58] heh also generally good, thanks! just about to head out for an errand... then do a bit more worky work, and do some housework in preparation for my kids arriving tomorrow, thus starting the kid-ful half of the week :) [02:39:18] milimetric: really nice to hear from u!! thanks for the help, hope to talk again soon... have fun with the car vids :) [02:40:16] tl;dr; get a Tesla Model 3 even if you hate Musk :P [04:26:56] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:28:40] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:25:49] Good morning [08:46:55] dcausse: hello :) [08:47:06] bonjour! [08:47:28] dcausse: Do you know who I should talk to in regard of the test_search_satisfaction_hourly in druid? [08:47:51] dcausse: I can give more context :) [08:48:13] joal: I guess it's us :) [08:48:27] dcausse: I have found this dataset in our druid-public cluster instance, where it probably shouldn't belong, and I was wondering how it end up there :) [08:50:02] joal: I think it mainly contains aggregated data but I don't think we ever planned to make that data public [08:50:36] dcausse: druid-public is not public directly, it is behing the AQS frontend, so no data leak in any case [08:51:20] looking at the code base but I remember that Erik made few experiments with druid but not sure if it's actually used or not [08:51:44] dcausse: shall I wait for Erik and talk with him? [08:52:03] joal: Erik will have a lot more context for sure [08:52:12] ok let's wait dcausse :) [08:52:14] thank you :) [08:52:17] np! [08:55:48] joal: the related job is failing on our side for more than one month :/ [08:56:01] dcausse: mwarf! [08:56:03] HTTPConnectionPool(host='druid1002.eqiad.wmnet', port=8090): Max retries exceeded with url: /druid/indexer/v1/task (Caused by NewConnectionError(': Failed to establish a new connection: [Errno -2] Name or service not known')) [08:56:58] Ah! I think i know [08:58:19] dcausse: the datasource on public is from this morning from what I can see [08:59:58] joal: does it have data? [09:00:17] dcausse: our druid instances in the internal cluster are named an-druid100X IIRC [09:01:50] yeah - we have an-druid100[12345] and druid100[45678] for the public one [09:02:04] ok that explains I think [09:02:06] dcausse: small data, but some data! [09:02:47] so switch from druid1* to an-druid1* should be enough? [09:03:06] dcausse: IIRC we renamed our druid internal instances to mimic our other workers when the cluster got bigger (or maybe another reason I can't recall) [09:03:11] dcausse: I think it would yes! [09:03:53] dcausse: I'll wait for a chat with Erik on dropping the public cluster data [09:04:34] btullis: hello :) [09:04:56] btullis: Would you have a minute for me, I wanted to show you something on the cassandra front [09:05:17] btullis: Oh and by the way - thank you a lot for the fast turnaround on cassandra2 snapshots deletion! [09:06:35] joal: sure, thanks for the ping! [09:38:08] ah I think I know what happened, Erik must have fixed the druid host "druid1002 -> druid1004", perhaps druid1002 was private previously? [09:38:29] it seems it was done 2 days ago [09:52:29] joal: Hiya. You're welcome. Care to chat in the batcave? [09:52:40] hi btullis - sure, joining [09:53:05] dcausse: indeed druid1002 was private, now is an-druid100X :( [09:53:11] dcausse: sorry for the problem [09:53:50] ok makes sense, I updated the job with an-druid1004 and reran it for the last 2 days, I think you can delete the data [10:12:29] ack dcausse - doing - thanks for the fix! [12:07:17] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) I am making significant progress on this now and have several more CRs to merge. Al HDFS and Ya... [12:35:10] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Marking this task as paused, because we are going to wait for T291472 to be completed first. [13:07:59] (03CR) 10Kosta Harlan: [C: 03+1] Add an image: update schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) (owner: 10MewOphaswongse) [13:21:47] 10Analytics, 10Event-Platform, 10Readers-Web-Backlog, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10EYener) [13:33:02] 10Analytics, 10Event-Platform, 10Readers-Web-Backlog, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) [13:38:49] (03CR) 10DCausse: [C: 03+1] Add performer field to sparql/query [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/735445 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [13:39:39] 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering, 10SRE, 10ops-eqiad: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Ottomata) Thank you! Just in time! :) [13:40:05] elukey: do we need a new rule in the VLAN firewall to allow druid public? [13:40:08] do an-db* [13:40:10] to* [13:54:00] ottomata: o/ [13:54:20] sorry I didn't get the use case [13:54:29] druid-public -> an-db? [13:55:54] yes... druid public will need to access the druid_public_eqiad db [13:55:55] right? [13:56:59] ah yes yes, but druid public is outside the analytics vlan and an-db100x has an ip inside it, so in theory it should already be allowed by the current ferm rules [13:58:44] OH right [13:58:45] in allowed [13:58:47] out not. [13:59:02] exactly yes, and ferm should already be ok [14:35:31] ottomata: Good morning - would you have some time to talk about gobblin-metrics with me? [14:36:10] ottomata: Now that we've defined a metric-integration plan, I'm hitting implementation problems [15:01:58] 10Analytics, 10DBA, 10Event-Platform, 10WMF-Architecture-Team: Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) Huh, very interesting paper! @Nuria did you read? I mostly understand, but had some questions as I worked... [15:02:14] joal: hello! [15:02:23] Hi ottomata :) [15:02:24] sorry just missed this, was reading https://arxiv.org/pdf/2010.12597v1.pdf [15:02:33] ottomata: In tech-meeting now - later maybe? [15:02:41] errrr tech meeting starting, and i have meetings/interviews for the next 2.5 hours [15:02:52] actually 3 really [15:02:52] later, ot tomorrow ottomata [15:02:56] either ya! [15:55:59] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10razzi) [15:56:01] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Setup Presto UI in production - https://phabricator.wikimedia.org/T292087 (10razzi) 05Open→03In progress [15:56:12] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Setup Presto UI in production - https://phabricator.wikimedia.org/T292087 (10razzi) [15:57:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) I haven't encountered any issues and nobody has reported any, so after one last pass through the staging instance,... [16:00:34] (03PS2) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [16:00:48] (03CR) 10MewOphaswongse: Add an image: update schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) (owner: 10MewOphaswongse) [16:04:55] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) I am still not sure where this server is, I cannot find it. @Jclark-ctr is out this week. [16:15:31] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10razzi) a:05nshahquinn-wmf→03Milimetric I hear Dan is workin on this Dan do you need reviewers... [16:45:39] (03CR) 10Razzi: [V: 03+2 C: 03+2] "Manually tested this on staging, no regressions found" [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/724783 (owner: 10Razzi) [16:47:42] (03CR) 10Razzi: "Going straight to 1.3.1" [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/721340 (owner: 10Razzi) [16:48:00] (03Abandoned) 10Razzi: Update superset package to 1.3 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/721340 (owner: 10Razzi) [16:52:51] (03PS1) 10Razzi: Update superset package to 1.3.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/736535 [16:54:34] (03CR) 10Razzi: [V: 03+2 C: 03+2] "The change is just a 1-liner in the requirements and we've tested on staging, so I'm going forward with this!" [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/736535 (owner: 10Razzi) [16:55:06] (03Abandoned) 10Razzi: Update superset package to 1.3.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/724783 (owner: 10Razzi) [16:57:38] !log dump mysql in preparation for superset upgrade [16:57:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:08] Ok superset is down, unexpected database error when upgrading [17:03:13] Ben and I are looking in to it right now [17:07:05] Turning off the superset ui as we debug the database [17:07:56] !log razzi@an-tool1010:~$ sudo systemctl stop superset [17:07:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:29:55] ok, who wants to deploy the mw history snapshot? code review up at https://gerrit.wikimedia.org/r/c/operations/puppet/+/736542 and instructions at https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [17:41:37] (03PS3) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [17:47:59] (03PS4) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [17:55:01] milimetric: just sent a CR for druid datasource bump [17:55:18] joal: oh was mine wrong? [17:55:33] oh sorry milimetric - I meant I sent a comment [17:55:37] oh I see [17:55:40] ok, doing [17:55:45] <3 [17:57:21] milimetric: if you have time I'd also like to talk about the sqoop addition for discussiontools_subscription [17:57:39] joal: I had no idea we had a separate yaml... I'll update the docs. Also I feel like this is even more of a candidate for automation now [17:58:19] joal: I think we both have 1/1s with Olja coming up, I'm free after yours if you're around [17:58:27] I'll read the patch in the meantime [17:58:32] yeah let's do that milimetric :) [18:06:51] milimetric: in case it's not on purpose: on the AQS config patch you updated the commit message but not the second file [18:10:03] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:20:15] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:27:27] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10Bumeh-ctr) Hi @Ottomata, thanks for the clarification . Sorry it took me this long to make this comment. You are right that our reports are annual and I also think we s... [18:30:37] heh, doh, forgot -a in my commit, thanks joal [18:34:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) The upgrade finished successfully, there was about 30 minutes of downtime as an unexpected database error blocked... [18:37:35] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10razzi) This has been fixed in superset 1.3.1! [18:41:47] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Data structuring guidance request - https://phabricator.wikimedia.org/T287402 (10Ottomata) Hello! > we have metrics for countries, sub-continents and continents which were computed from, for instance, Editors and Pageviews data that already exist i... [18:45:43] razzi: this is ready and I think you're the best person for the job :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/736542 [18:48:09] (03PS5) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [18:48:15] milimetric: want to hop on a call to talk about the patch? [18:49:08] omw [18:49:18] in the cave razzi [18:59:49] mforns: got 30 mins! [19:00:00] :] bc? [19:00:09] ottomata: ^ [19:00:23] bc busy! [19:00:39] https://meet.google.com/kti-iybt-ekv [19:03:26] joal: I'll be in the cave when you're ready [19:06:15] arg! forgot about daylight confusion, a meeting is earlier than usual [19:16:39] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10razzi) 05Open→03Resolved [20:24:48] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) a:05Milimetric→03nshahquinn-wmf Yes, that's why I had assigned it to myself 😊 [20:30:11] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10nettrom_WMF) 05Resolved→03Open Still broken for me in 1.3.1, both when looking at a previously broken chart, or trying to add annotations to a new chart. Let me know if [[ https://superset.wikimedia.org/r/... [21:00:05] (03PS6) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [21:22:14] (03PS1) 10Bearloga: movement_metrics: Migrate pageviews_corrected [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/736583 (https://phabricator.wikimedia.org/T291956) [21:26:21] (03PS2) 10Bearloga: movement_metrics: Migrate pageviews_corrected [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/736583 (https://phabricator.wikimedia.org/T291956) [21:28:24] (03PS7) 10MewOphaswongse: Add an image: update schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736070 (https://phabricator.wikimedia.org/T294659) [22:27:32] (03CR) 10Cicalese: [C: 03+2] Update PHP and version queries [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/730389 (owner: 10Tim Starling) [22:30:14] (03CR) 10Cicalese: [C: 03+2] "I tested the reports on stat1007, and they ran successfully." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/730389 (owner: 10Tim Starling)