[00:00:03] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:05:47] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal-sanitization_daily on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal-sanitization_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:30:51] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal-sanitization_daily on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal-sanitization_daily https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:56:12] 10Data-Engineering, 10Data-Engineering-Kanban: Implement one golang AQS microservice - https://phabricator.wikimedia.org/T299729 (10Milimetric) [01:03:34] !log rerunning the eventlogging_to_druid_network_flows_internal-sanitization_daily timer that failed to get logs [01:03:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [01:06:46] 10Data-Engineering, 10Data-Engineering-Kanban: Implement one golang AQS microservice - https://phabricator.wikimedia.org/T299729 (10Milimetric) [01:07:43] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: Implement aggregate endpoint of the pageviews API - https://phabricator.wikimedia.org/T299731 (10Eevans) [01:09:15] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Implement pageviews endpoints - https://phabricator.wikimedia.org/T288296 (10Eevans) [01:09:17] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Eevans) [01:09:20] 10Analytics, 10Code-Health-Objective, 10Epic, 10Platform Engineering Roadmap, and 2 others: Implement aggregate endpoint of the pageviews API - https://phabricator.wikimedia.org/T299731 (10Eevans) [01:10:59] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Implement top endpoint of the pageviews API - https://phabricator.wikimedia.org/T299732 (10Eevans) [01:12:37] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Implement top-by-country endpoint of the pageviews API - https://phabricator.wikimedia.org/T299733 (10Eevans) [01:14:06] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Implement top-per-country endpoint of the pageviews API - https://phabricator.wikimedia.org/T299734 (10Eevans) [01:20:43] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10Eevans) [03:34:23] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10jwang) @BTullis, thank you very much for working on this issue. I tried the query . It failed with a different error. (please see log below). Log of error: ` -----... [07:27:15] good morning folks! [07:27:23] there is something interesting about matomo [07:28:06] after the reboot, there is a unit that fails to execute, that is the one that periodically restarts the prometheus mysql exporter if there are upgrades etc.. [07:28:18] I am not sure if we have done something manually before, but afaics: [07:28:57] - wmf_auto_restart_prometheus-mysqld-exporter@matomo.service fails due to prometheus-mysqld-exporter@matomo [07:29:14] - in turn prometheus-mysqld-exporter@matomo fails since it can't find mariadb@matomo.service [07:30:00] We have a plain old mariadb.service, not a specific instance, and its exporter is masked (prometheus-mysqld-exporter) [07:30:32] now usually this is the correct config (since we have mostly mariadb instances everywhere), but matomo is special, so there may be some puppet change to do [07:31:25] yeah we don't have metrics [07:31:26] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&from=now-10d&to=now&var-job=All&var-server=matomo1002&var-port=13306 [07:32:43] class is profile::piwik::database [09:42:03] joal: Hey! About superset access for Andrea, do you know what I should request? [09:42:28] gehel: Hi - do we expect her to wish to use more than superset? (CLI or notebooks) [09:42:38] I'm not sure yet [09:43:10] she seems to have experience with pyhive, so that does into the direction of giving her a lower level access than superset [09:43:27] gehel: I guess you'll find answers here, but I'm glad to help as needed: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? [09:43:50] thanks ! I'll have a look [09:44:59] I think I'll start with the superset access [09:45:09] ack :) [09:57:17] elukey: ack. Thanks for looking at it. I can't think of any recent config changes to matomo that would have caused this. [09:58:21] yeah I think it was a manual fix [09:59:30] Oh, I see. So someone must have unmasked the mariadb@matomo.service and this worked for a long time, but puppet put it back and the next reboot broke it? [10:04:45] I am wondering if we masked the prometheus mariadb matomo exporter and unmasked the prometheus mariadb exporter (that usually it is masked) [10:05:05] and with "we" I blame myself, I could be the root cause :D [10:05:37] OK, no worries. I'll try to sort it now. [10:20:38] Hi btullis - I'm looking quickly at cassandra heap used dashboards - would you have aminute to talk with me? [10:29:02] joal: Yes. Shall we bc? [10:29:10] yes! on my way :) [10:53:34] Unmasking and starting the non-matomo prometheus exporter didn't work. `msg="failed reading ini file: open /var/lib/prometheus/.my.cnf: no such file or directory"` [10:57:06] interesting [11:01:37] It just looks like mysqld_exporter_instance doesn't like not having a named mariadb instance to work with: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/mysqld_exporter_instance.pp#L2 [11:01:57] Excuse the double-negative. [11:05:28] the metrics stopped right after the reboot, so something changed [11:07:18] Agreed. [11:33:42] I think it was caused by this commit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/716306 [11:35:33] When we create the `profile::prometheus::mysqld_exporter_instance {'matomo':` we override $socket with the mysql default: but since this commit, the $title is used in the systemd unit file for the [11:35:33] prometheus mariadb exporter that is created. [11:36:40] nice find! [11:36:42] I'll speak to data persistence about it. [12:42:38] heya teammm [12:43:03] Wotcha mforns. :-) [12:43:15] :] [12:43:31] ottomata: just read the thing about skein keytabs... [12:47:53] 10Data-Engineering, 10Data-Engineering-Kanban: Matomo mariadb metrics are not being scraped by prometheus - https://phabricator.wikimedia.org/T299762 (10BTullis) [13:48:20] Hi ottomata - do you wish we talk about the network_flows_internal-sanitization error? [13:48:57] sure joal [13:49:06] mforns: yeah will be testing today [13:49:17] mforns: just pushed code i was working on yesterday, stll wip esp now with keytabs [13:49:26] also working on adapting SparkSqlOperator to use the new SparkSubmitOperator [13:49:29] ottomata: OK, will look! [13:49:36] awesome [13:49:46] thought: if we DO use skein...we might not need the SparkSQLNoCLIDriver! [13:49:51] we might still want to use it anyway [13:49:56] ok joal lets see [13:50:16] how can I help? i dont' see anything in logs, and dan reran and didn't get much info. [13:50:22] makes sense [13:50:44] My guess ottomata is that the job runs but there is no data to sanitize, and that makes it fail [13:51:09] hm ok [13:51:25] but that shouldn't make it fail then, right? issue wiitht job exitcode? [13:51:59] And actually by looking the logs, it seems that things are not right in the job config - looking [13:53:24] ohk [13:53:49] nope, wass wrong - things are ok - will continue to investigate [13:54:38] ok that was what I expected: the druid indexation job failed [13:54:58] sanitization jobs for druid are re-indexation jobs that use druid as source as well as target [13:55:26] and my guess is that the druid indexation job fails because there is no data for the time we ask it to process [13:55:35] will check in druid [13:56:37] ohhh [13:56:44] so not even in hdfs, interesting [13:59:19] joal, but the immediate job failed as well no? [14:02:39] heya team, do you use the same ssh keys for Gerrit and GitLab? Or have separate ones? [14:03:25] mforns: nope, the hourly job suceeded and the daily one as well (they failed a few days back) [14:03:33] only the sanitization failed [14:03:35] ah! ok ok [14:03:42] mforns: i think i use the same keys [14:03:52] ok ottomata I do too [14:04:33] interesting - looks like latest logs on an-druid1003 are from 2021-11 :( [14:04:46] ouch [14:06:20] eh? [14:07:10] ottomata: on an-druid1003, in /var/log/druid, overlord.log for instance [14:07:40] that one has data up to 2022-01-11 [14:11:33] joal: meaning... that druid stuff hasn't been running there? [14:13:24] ottomata: I don't know! I found logs for middlemanager and historical - but not for overlord :( [14:13:47] ottomata: I found confirmation of my ideas in middlemanager logs: SegmentDescriptorInfo is not found usually when indexing process did not produce any segments meaning either there was no input data to process or all the input events were discarded due to some error [14:14:17] So the druid indexation job failed due to no input - I would have expected the thing to be more robust :( [14:14:20] okay [14:14:31] can we make it more robust? :) [14:14:34] I'm gonna re-absent the sanitization job [14:14:36] mforns: ^ ? [14:14:38] okay [14:15:13] ottomata: what is the question? [14:15:29] whether we can make it more robust? [14:15:41] ah, I see [14:16:08] yea, I would have expected the thing to be more robust as well :] [14:16:17] let's do it [14:17:04] I think it's gonna complicated to make it more robust on our side - the failure is on the druid indexation side [14:19:08] joal: we can just not call druid indexation if the DataFrame is empty no? [14:19:26] that could work mforns! [14:19:29] are we adding an extra data read to do so? [14:19:44] isn't there any "free" way to checking that? [14:20:00] yeah we're adding some compute, but it shouldn't be very expensive [14:20:19] or at least I think [14:21:23] oh! joal, we can check the existence of the temp json file (after read) [14:21:40] if it does not exist, then we don't call indexation [14:21:49] mforns: I think it'll exist even if empty [14:21:56] oh :( [14:21:59] why? [14:22:07] mforns: we should check on file-size [14:22:18] ok, good idea [14:22:22] mforns: spark will create a file even if empty [14:22:41] ah, is it a directory? [14:25:13] when using spark.write, it creates a directory and writes inside - I think it'll write an empty file if dataset is empty [14:26:51] I see joal, I was looking if there are temp files to check, but the job deletes everything at cleanup [14:30:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Matomo mariadb metrics are not being scraped by prometheus - https://phabricator.wikimedia.org/T299762 (10BTullis) p:05Triage→03Medium The reason that this failed recently, is that we had a combination of single-instance mariadb configura... [14:40:06] ottomata: just submitted a patch [14:40:58] k [14:51:26] joal: 09:40:15 Line 6: Bug: value must be a single phabricator task ID [14:51:35] missing T i think [14:51:41] MEh :( [14:51:43] correcting [15:08:37] merged and applied joal [15:20:27] wow i think i got it [15:20:34] re skein and keytabs and spark! [15:20:42] need to cleanup code [15:20:43] but [15:21:49] i convert e.g. the --keytab spark subit opt into a skein files param, which uploads the keytab to the worker, then I change spark's ---keytab opt to use the worker's relative path to the uploaded file! [15:21:57] just like with archives or whatever! [15:56:55] joal: Have you got another minute to discuss the cassandra graphs again? I have a theory. [15:59:31] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) @JAllemandou noticed several really interesting thing about the Java heap graphs, which migh... [16:00:41] joal: any particular reason the SparkSQLNoCLIDriver doesn't log or print the query result to the driver, like SparkSQLCLIDriver does? [16:06:23] mforns: i got keytabs to work, woohoo! [16:09:25] ottomata: yeyeyeyeyeye!!! [16:09:33] just pushed my change [16:09:47] ok, will review [16:10:19] h i thought i did [16:10:40] ? [16:11:25] oh i did [16:11:29] sorry gitlab wasn't refershing right [16:11:49] mostly it is allin the skeinize() (name TBD?) method of SparkSubmitHook [16:11:58] see line 230-250 [16:18:44] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) Now I look at it more closely, the heap exhaustion seems to correlate much more closely with... [16:36:27] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) If you agree, then I'm ready at any time to do this: ` btullis@puppetmaster1001:~$ sudo -i c... [16:36:46] hm got some weridness now with archive files, but i think i can make it work [16:37:40] heya folks - I was away for kids [16:37:47] btullis: wanna chat? [16:38:20] Yeah, sure. It's about this: https://phabricator.wikimedia.org/T298516#7640651 [16:40:33] 10Analytics-Radar, 10Event-Platform: Eventgate validation error: '.event.connectEnd' should be >= 0, '.event.connectStart' should be >= 0, '.event.fetchStart' should be >= 0, '.event.requestStart' should be >= 0, '.event.responseEnd' should be >= 0, '.event.responseSta... - https://phabricator.wikimedia.org/T299670 [16:41:18] super nice btullis [16:41:35] I have a comment a a question :) [16:43:06] comment: the reason for which every instance exhibits the same behavior is because, as you mentionned, reads are distributed - queries go to whatever node is pooled, and that node then queries another node (if needed) to get the result - leading to every node exhibit read-related problems if any [16:43:47] question: there was heap exhautstion just before the read-related pattern - I assume we would tie this one to the loading? [16:44:09] 10Analytics-Radar, 10Event-Platform: Eventgate validation error: '.event.connectEnd' should be >= 0, '.event.connectStart' should be >= 0, '.event.fetchStart' should be >= 0, '.event.requestStart' should be >= 0, '.event.responseEnd' should be >= 0, '.event.responseSta... - https://phabricator.wikimedia.org/T299670 [16:44:50] ottomata: about keytab and skein - do we feel safe in uploading keytabs to workers? [16:45:34] ottomata: heya - wanna chat about SparkSQLCLI? [16:46:00] joal: Yes, the previous comment on the ticket is what I wrote after our conversation. That highlights the time of the loading of the /big/ table. And yes, this does definitely exhibit high heap usage, which wasn't seen during the loading of the /huge/ table. Still can't explain that bit adequately. [16:46:33] btullis: makes sense [16:46:33] But we didn't enable any reads until December 17th, which is when aqs1010 was pooled for the first time. [16:46:41] ack [16:47:23] btullis: let's wait until mid-next week to check that aqs1010 is happy, and then we could repool, and watch if your theroy is correct :) [16:48:02] 10Analytics-Radar, 10Event-Platform: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10cjming) [16:49:35] joal: ack. How does Tuesday morning sound to you? :-) [16:51:01] 10Analytics-Radar, 10Event-Platform, 10VisualEditor-MediaWiki-Templates: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10cjming) [16:51:49] sounds great btullis :) [16:56:19] 10Data-Engineering, 10Data-Engineering-Kanban: Implement one golang AQS microservice - https://phabricator.wikimedia.org/T299729 (10Eevans) [16:59:12] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) Suggestion from @JAllemandou is to wait until mid-week (maybe Tuesday morning 2022-01-25) be... [17:04:40] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10BTullis) I'm afraid I don't know what the cause of this might be. The fact that you get this message: `AttemptID:attempt_1637058075222_385913_m_000174_3 Timed out aft... [17:16:54] 10Analytics-Radar, 10Event-Platform, 10VisualEditor-MediaWiki-Templates: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10cjming) [17:18:10] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) >>! In T298516#7640651, @BTullis wrote: > Scratch that previous theory... > > Now I look at... [17:53:01] 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Eventgate validation error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T299669 (10Aklapper) (Please add code project tags when possible, so... [17:54:02] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) As mentioned in this issue: https://github.com/linkedin/datahub/issues/3504 > ...Datahub does not officially support a non-docker based installation but I would r... [17:55:07] 10Analytics, 10Event-Platform, 10Wikimedia-production-error: Eventgate validation error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T299669 (10Ottomata) And of {T261665} [17:55:39] joal: your ping! sorry [17:55:40] yes lets chat [17:56:41] I'm off now folks. Have brilliant weekends! [17:59:01] byeeeee! [18:13:35] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10mforns) @jwang Agree with @BTullis, this dataset is partitioned hourly. This means that looking at 1 year of data it will process 24*365=8760 partitions. Since the qu... [19:03:00] mforns: ok fixed the local --archives thing too.... i think it all works [19:03:45] its pretty un-obvious and maybe a little hacky, but i think necessary [19:03:50] tried to document a lot [19:04:00] see hooks/spark.py line 241 and below: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/7/diffs#e4e006fa6f9b08ddc6e65dffbb2ebd764ccfcbec [19:04:06] lemme know if you are still around and want to talk abou tit [19:35:33] 10Data-Engineering, 10Data-Engineering-Kanban: Hive query failure in Jupyter notebook on stat1005 - https://phabricator.wikimedia.org/T297734 (10jwang) 05Open→03Resolved @mforns @BTullis , thank you both for explaining the details. I consider this issue is fix and mark it as resolve. [19:48:07] 10Data-Engineering, 10Stewards-and-global-tools: Collect information about users affected by blocks - https://phabricator.wikimedia.org/T297051 (10Tgr) Differential privacy seems like a tricky issue here, unless queries are limited to large ranges. [23:11:28] 10Data-Engineering, 10Superset: Document and share Superset Hive Date Filter Guidance - https://phabricator.wikimedia.org/T299681 (10odimitrijevic) [23:25:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset: Document and share Superset Hive Date Filter Guidance - https://phabricator.wikimedia.org/T299681 (10odimitrijevic) [23:26:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset: Document and share Superset Hive Date Filter Guidance - https://phabricator.wikimedia.org/T299681 (10odimitrijevic) p:05Triage→03High