[00:26:06] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:32:09] 10Analytics: General Usage statistics for AQS - https://phabricator.wikimedia.org/T284610 (10Milimetric) [12:21:02] 10Analytics, 10Event-Platform, 10Performance-Team, 10Patch-For-Review: EditorActivation Event Platform Migration - https://phabricator.wikimedia.org/T284679 (10Gilles) I honestly have no idea, I've never looked at this data. [13:06:27] 10Analytics, 10Event-Platform, 10Performance-Team, 10Patch-For-Review: EditorActivation Event Platform Migration - https://phabricator.wikimedia.org/T284679 (10Ottomata) @gilles, you updated {T282131} noting that this schema should be migrated. Who uses this data if not you? Should we decom it instead? [13:20:11] 10Analytics, 10Event-Platform, 10Performance-Team, 10Patch-For-Review: EditorActivation Event Platform Migration - https://phabricator.wikimedia.org/T284679 (10Gilles) The code writing to that schema is still active in MediaWiki. I don't know if anyone is still consuming that data and in what form. It just... [13:27:20] 10Analytics, 10Event-Platform, 10Performance-Team, 10Patch-For-Review: EditorActivation Event Platform Migration - https://phabricator.wikimedia.org/T284679 (10Ottomata) Ah I see ok. Hm. We are trying to turn off streams that don't have owners; We do have [[ https://stats.wikimedia.org/#/en.wikipedia.org/... [13:31:39] 10Analytics: Address jackson version security vulnerabilities in refinery-source - https://phabricator.wikimedia.org/T272058 (10hashar) [13:42:06] 10Analytics, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Automate regular WDQS query parsing and data-extraction - https://phabricator.wikimedia.org/T273854 (10MPhamWMF) [14:02:42] 10Analytics, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10WMDE-New-Editors-Banner-Campaigns: Drop old WMDEBanner events from Hive - https://phabricator.wikimedia.org/T281300 (10Ottomata) FYI, WMDEBannerEvent had a badly interpreted field type in Hive from years ago. Its `event.eventRate` had been in... [14:06:39] hey teamm! [14:06:59] ottomata: I just saw your email, awesome you found the thing [14:07:20] !log altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562 [14:07:22] oops wrong ticket [14:07:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:07:24] T282562: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 [14:08:10] mforns: yeah! so [14:08:18] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10Ottomata) FYI, WMDEBannerEvent had a badly interpreted field type in Hive from years ago. Its `event.eventRate` had been interpreted as an integer inste... [14:08:32] it was working for us yesteday because of a difference in the configuration of refine_eventlogging_analytics and the default way reifne jobs behave [14:08:40] oh... [14:08:44] namely the merge_with_hive_schema_before_read [14:08:48] aha [14:08:48] parameter to refine [14:09:02] that would have taken me ages to find out.. [14:09:35] yeah i started going up the code, was going to manually execute the RefineTarget.find functon to get the RefineTarget, rather than instantiating it directly [14:09:43] and then realized that was a parameter and remembered [14:28:29] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10kai.nissen) Huh, I'm pretty sure that we saw non-integer values in there. Anyway, probably that also applies to the two other events. [15:35:19] 10Analytics-Radar, 10Event-Platform, 10MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Technical-Debt (Deprecation process): extensions/EventBus - Use UserGroupManager instead of User group methods - https://phabricator.wikimedia.org/T281825 (10Vlad.shapik) 05Ope... [15:36:06] 10Analytics-Radar, 10Event-Platform, 10MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Technical-Debt (Deprecation process): extensions/EventBus - Use UserGroupManager instead of User group methods - https://phabricator.wikimedia.org/T281825 (10Vlad.shapik) a:03V... [16:04:28] milimetric: standup? [16:05:21] no internet ottomata, I msged in slack, should've done so here [16:05:54] (cell service is like 1 bar right now, whole town lost power) [16:25:33] !log rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194 [16:25:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:31:07] razzi: o/ remember to wait minutes between restarts, the GC metrics need to return into an healthy state first [16:39:49] elukey: that's built into the cookbook, right? [16:41:39] 10Analytics, 10Analytics-Wikistats: Top edited pages list on enwiktionary contains nonexistent pages with titles made up of question marks - https://phabricator.wikimedia.org/T284623 (10odimitrijevic) p:05Triage→03High [16:45:26] 10Analytics, 10Analytics-Wikistats: Top edited pages list on enwiktionary contains nonexistent pages with titles made up of question marks - https://phabricator.wikimedia.org/T284623 (10odimitrijevic) a:03Milimetric Related to https://phabricator.wikimedia.org/T230915 [16:46:02] 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Milimetric) a:03Milimetric seems related to T230915, so I'm going to look at both. [16:46:25] razzi: the cookbook waits for some minutes IIRC, but it is always better to keep an eye on the graphs [16:46:28] lemme check [16:46:54] 10Analytics: General Usage statistics for AQS - https://phabricator.wikimedia.org/T284610 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [16:47:25] ah it wais 10 mins by default [16:47:27] should be ok [16:47:48] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Top edited pages list on enwiktionary contains nonexistent pages with titles made up of question marks - https://phabricator.wikimedia.org/T284623 (10Milimetric) [16:49:18] 10Analytics: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10odimitrijevic) p:05Triage→03High a:03mforns [16:53:30] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10odimitrijevic) p:05Triage→03Medium a:03razzi [16:53:45] * elukey afk! [16:54:44] ah weird, razzi did the cookbook fail? [16:54:55] elukey: yeah I was going to ask about that, I somehow sent the stdin an eof [16:55:38] elukey: https://phabricator.wikimedia.org/P16421 [16:55:47] so I plan to do the rest of the steps manually [16:56:21] razzi: ahhh okok so thing weird happening, good :) [16:56:50] 10Analytics, 10SRE, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10odimitrijevic) a:03odimitrijevic [16:59:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move WikimediaEventUtilities logging to Slf4j - https://phabricator.wikimedia.org/T284537 (10odimitrijevic) p:05Triage→03High [16:59:06] 10Analytics, 10Analytics-Kanban: Missing data in virtualpageview_hourly table since April 15, 2021 - https://phabricator.wikimedia.org/T282710 (10mforns) 05Open→03Resolved [16:59:17] 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10odimitrijevic) p:05Triage→03Medium [17:04:55] (03PS3) 10Mforns: Add safety limits to refinery-drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) [17:12:32] The cookbook roll-restart-masters errored out at line 154, so I'm going to run the last few commands manually and log them all here [17:12:49] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [17:12:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:24:10] !log sudo systemctl restart hadoop-hdfs-zkfc on an-master1002 [17:24:33] !log sudo systemctl restart hadoop-hdfs-namenode on an-master1002 [17:24:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:24:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:58:00] milimetric: got a sec to bb airflow thing? [17:58:09] trying to figure out how to parameterize a dag [17:58:12] to the batcave! [17:58:37] ottomata: ^ [18:17:51] !log sudo systemctl restart hadoop-mapreduce-historyserver [18:17:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:33:18] milimetric: its worrrkkking [20:36:33] ottomata: what? output of an operator defining dynamic operators? [20:36:37] nono [20:36:44] just the canary events via java thing [20:36:50] dynamic tasks, but using event stream config api [20:37:15] very cool [20:37:27] that jpy thing is crazy [20:37:33] its really slick [20:37:37] i can't believe how easy that was [21:10:49] I'm looking into this systemd failure monitor_refine_eventlogging_analytics and I'm seeing `unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly.`On an-launcher1002 there's no /home/analytics directory, so I bet creating that directory belonging to analytics will allow dconf to work. Should I go for it? [21:14:43] hmmm [21:14:51] analytics is a system user so it shoudln't really have a home.... [21:15:16] hmmm [21:15:36] razzi: we might want to ask moritzm about that [21:15:46] razzi: btw, i'm pretty sure that is just a warning [21:15:51] that systemd timer should succeed if you run it now [21:15:53] hm ok [21:16:05] i thiink its in failure mode from the wmde banner evnets failures we had yestereday [21:17:05] !log sudo systemctl restart monitor_refine_eventlogging_analytics [21:17:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:17:24] 10Analytics: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10Ottomata) I got ProduceCanaryEvents running nicely in Airflow using [[ https://jpype.readthedocs.io/en/latest/userguide.html | JPype ]] and wikimedia-event-utilities; so I didn't have to write an EventStreamCon... [21:19:29] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:21:31] Ok cool ottomata that fixed it ^ [21:21:37] great thanks razzi [21:21:40] i shoudl have done that earlier [21:21:51] milimetric: https://github.com/ottomata/analytics-airflow-spike [21:21:53] :) [21:21:59] ok yall gotta run for the eve, ttyt [23:08:10] 10Analytics-Clusters, 10Analytics-Kanban, 10observability, 10Patch-For-Review, and 2 others: Setup Analytics team in VO/splunk oncall - https://phabricator.wikimedia.org/T273064 (10razzi) [23:20:29] 10Analytics-Clusters, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10razzi) Update here: we had to roll back the caching because our data access permissions weren't used by caching. Architecturally, this is a real problem: the permissions are checked when the query... [23:20:33] 10Analytics-Clusters, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10razzi) 05Open→03Resolved [23:20:35] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi)