[07:30:26] !log kill remaining processes of rhuang-ctr on stat1004 and an-test-client1001 (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user. [07:30:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:36:57] hi folks, is it ok to reset-fail mediawiki-history-drop-snapshot on an-launcher1002? [07:38:07] !log systemctl reset-failed mediawiki-history-drop-snapshot on an-launcher1002 (opened since a week ago) [07:38:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:40:15] also an-launcher1002 still runs with puppet disabled, there is a timer that may need to be absented for a few days (netflow internal, no data going to the kafka topic due to the Marseille dc being set up) [07:47:31] RECOVERY - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:30:31] !log Deploying analytics/refinery on hadoop-test only. [09:30:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:26] 10Data-Engineering: The network_internal druid load job fails if data is not present - https://phabricator.wikimedia.org/T302263 (10BTullis) [10:13:30] elukey: I've made a CR to disable that timer and a task to fix the job - I presume that we'd want to fix it so that a lack of input data doesn't cause the job to fail, right? [10:13:48] * btullis I'll also look into the mediawiki-history-drop-snapshot job. [10:13:57] I'll also look into the mediawiki-history-drop-snapshot job. [10:16:28] hello! could someone please confirm that T302069 and T302230 are ok from your side? [10:16:29] T302069: Request membership in Analytics group for Aqu - https://phabricator.wikimedia.org/T302069 [10:16:29] T302230: Request gerrit membership in analytics repos for ntsako - https://phabricator.wikimedia.org/T302230 [10:17:44] btullis: thanks! IIUC fro Joseph the job fails on purpose to highlight the fact that no data is present (since normally it shouldn't happen). So in this case we can absent and sync with netops about re-enabling when the Mairseille DC is up (lemme know if it makes sense) [10:19:30] taavi: Yes, I can corroborate both of those requests. I'll make a note on the tickets. [10:20:43] elukey: Yes, makes perfect sense. I just didn't know that the 'fail-safe' option should be to exit with an error on no data available from the source. [10:25:27] 10Analytics, 10Gerrit-Privilege-Requests: Request membership in Analytics group for Aqu - https://phabricator.wikimedia.org/T302069 (10BTullis) I can confirm that @Antoine_Quhen is on the analytics team and I fully endorse his request for these privileges. I'll mention @odimitrijevic as his (and my) manager a... [10:26:31] 10Analytics, 10Gerrit-Privilege-Requests: Request membership in Analytics group for Aqu - https://phabricator.wikimedia.org/T302069 (10Majavah) 05Open→03Resolved a:03Majavah Done. [10:30:22] btullis: both of those are done now. [10:30:54] Thanks! [10:31:26] Excellent, many thanks taavi. [10:37:52] !log re-enabled puppet on an-launcher1002, having absented the network_internal druid load job [10:37:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [11:06:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [12:17:23] 10Data-Engineering: The network_internal druid load job fails if data is not present - https://phabricator.wikimedia.org/T302263 (10BTullis) Perhaps this behaviour is desired. Under normal circumstances the routers would always be generating data, so perhaps it would be correct for the job to exit with an error... [12:17:52] 10Data-Engineering: The network_internal druid load job fails if data is not present - https://phabricator.wikimedia.org/T302263 (10BTullis) [12:18:08] 10Data-Engineering: The network_internal druid load job fails if data is not present - https://phabricator.wikimedia.org/T302263 (10BTullis) [12:20:35] 10Data-Engineering, 10Data-Engineering-Kanban: Matomo mariadb metrics are not being scraped by prometheus - https://phabricator.wikimedia.org/T299762 (10BTullis) 05Open→03Resolved [12:23:17] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Stop ingesting data to the old AQS cluster - https://phabricator.wikimedia.org/T302276 (10BTullis) [12:24:59] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) [12:25:10] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10BTullis) [12:25:14] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Stop ingesting data to the old AQS cluster - https://phabricator.wikimedia.org/T302276 (10BTullis) [12:27:08] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Final cleanup tasks related to the AQS cluster migration - https://phabricator.wikimedia.org/T302278 (10BTullis) [12:28:36] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Switch over the Cassandra AQS cluster to the new hosts - https://phabricator.wikimedia.org/T297803 (10BTullis) This is now completed. I have created three further tasks within the epic to cover the decommisisoni... [12:56:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) I have a blocker on this and I can't seem to work out the right way to get past it. Currently when I run the following: `h... [13:39:06] 10Data-Engineering, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943 (10nskaggs) [13:45:15] 10Data-Engineering, 10DBA, 10Data-Services, 10MediaWiki-extensions-FlaggedRevs: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Kormat) -cloud-services, +data-engineering, as apparently the responsibility has moved... [15:33:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10JMeybohm) I now see that it might have been smart to answer here as this is a pretty good problem description, sorry for that. >>... [15:58:01] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10RhinosF1) [16:31:53] razzi: I forgot to add reviewers to my annotations patch. It's been broken for a while and I found another mobile layout bug when I was clicking through, so I'll submit that as well and try and deploy them both next week maybe [16:36:58] 10Data-Engineering-Kanban, 10Data-Catalog: [[wikitech:Data Catalog Application Evaluation Rubric]] links to some non-public Google Doc "execution plan" - https://phabricator.wikimedia.org/T299900 (10Milimetric) Thanks for flagging, pointed to the published version (that google doc was just a draft for the on-w... [16:50:45] hello, I'm trying to use mariadb.run() from the wmfdata python package, and since recently, it appears to error out with `AttributeError: 'str' object has no attribute 'decode'` for some queries, see https://phabricator.wikimedia.org/P21317 as an example of a failing query [16:51:47] urbanecm: it's a known issue that we're trying to fix in https://github.com/wikimedia/wmfdata-python/pull/29 [16:52:16] thanks Nettrom! Are there any known workarounds? [16:53:54] urbanecm: maybe, let me try something [16:56:44] urbanecm: the mentorship weight is always an integer? casting it to that (or to varchar) gets around it, e.g. SELECT up_user, CAST(up_value AS INTEGER) AS mentorship_weight [16:58:52] thanks! [17:01:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Kubernetes Deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) Many thanks @JMeybohm for those clues. I have now updated my WIP patch and `helm lint` is happy at all levels, although th... [18:07:20] 10Data-Engineering, 10Product-Analytics: wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10ldelench_wmf) p:05Triage→03Low [18:07:53] 10Data-Engineering, 10Airflow, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10EChetty) a:03EChetty [18:19:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10razzi) Hi @bmansurov, check your email and follow the instructions there. Apologies for the 3-week delay! [18:29:31] (03PS3) 10Joal: [WIP] Add flink job reporting webrequest patterns [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/763610 [18:35:32] 10Data-Engineering, 10Data-Catalog: Connect MVP to Hive metastore [Mile Stone 4] - https://phabricator.wikimedia.org/T299897 (10EChetty) a:03Milimetric [18:37:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add flink job reporting webrequest patterns [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/763610 (owner: 10Joal) [18:42:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Configure MariaDB database for DataHub on an-coord1001 - https://phabricator.wikimedia.org/T301459 (10Milimetric) [19:11:03] (03PS4) 10Joal: [WIP] Add flink job reporting webrequest patterns [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/763610 [19:17:54] (03PS5) 10Joal: [WIP] Add flink job reporting webrequest patterns [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/763610 [19:45:28] 10Data-Engineering, 10Data-Catalog: Connect MVP to Hive metastore [Mile Stone 4] - https://phabricator.wikimedia.org/T299897 (10Milimetric) Note to self mostly: I have opened a few threads in DataHub slack about push-based ingestion. It looks like we have to write it ourselves, but I'm following up with a few... [20:29:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Structured-Data-Backlog, and 3 others: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10Gehel) [22:36:17] 10Data-Engineering, 10Data-Engineering-Kanban: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10bmansurov) Hi @razzi, thanks for the instructions. I got my access. [23:58:11] (03PS36) 10AGueyte: Basic ipinfo instrument setup [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415)