[07:22:03] 10Analytics: Review use of realloc in varnishkafka - https://phabricator.wikimedia.org/T287561 (10elukey) [08:14:58] https://blogs.apache.org/foundation/entry/the-apache-cassandra-project-releases - cassandra 4 released! [09:05:10] 10Analytics, 10Dumps-Generation, 10Pageviews-Anomaly: Monthly Wikimedia pageviews dumps cann't be decompressed - https://phabricator.wikimedia.org/T287565 (10ArielGlenn) The monthly files are not in bz2 format; they are in https://en.wikipedia.org/wiki/Apache_Parquet. That's always been the format for the mo... [09:12:01] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) Great catch @elukey. I'll try that test today if I can get agreement to restart the daemons. [09:19:01] Hi a-team - I'd like to restart the hive daemons on an-test-coord1001 at some point soon. Should I schedule a maintenance window for this, or should I simply go ahead? Are there any other users of the test cluster whose permission I should seek as well? For context: https://phabricator.wikimedia.org/T279304#7240595 [10:16:10] btullis: nono you can go ahead any time, it is the test cluster [10:16:23] (sorry just seen the msg) [10:17:01] Cool. Will do then, thanks. [10:46:13] !log btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service [10:46:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:46:21] !log btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service [10:46:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:58:47] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) As suggested, I have taken a manual copy of the configuration file: ` btullis@an-test-coord1001:/etc/hive/conf$ sudo cp hive-log4j.properties hive-log4j2.properties ` I res... [11:08:00] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) Then again, maybe it's fine. I think that we have most of the error levels set to WARN apart from our custom logging levels, which are all set to INFO. Some excerpts from the... [12:57:11] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) I haven't managed to get anything to appear in those logs yet, although I've only tried a couple of test queries from the hive cli. I'm wondering if we should potentially lo... [13:03:06] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10Ottomata) Those improvements mostly ended up in a work around. Maybe switching to a log4j2 config would avoid them in the first place? Thanks Ben! [13:13:38] jbond: i'd like to move foward with my admin userr patch, i'm going to make a patch on your homedir one and see if you like it [13:13:40] and try to get things merged [13:13:42] s'ok? [13:14:44] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) I have found the templates that seem to have been supplied with the original bigtop distro and copied both the `hive-log4j2.properties` and `hive-exec-log4j2.properties` temp... [13:15:26] ottomata: ack let me ping mori.tzm [13:16:10] ottomata: fyi i just sent a response regarding using undef [13:16:15] ohk [13:16:34] oh you can't pass undef? [13:16:39] REaLLY!@??@? [13:16:57] crzy i guess we could use false but at that point 'none' is just as good? [13:17:14] yes its a real pain, there are hacks to do it via hiera sort of but its not pretty [13:17:40] is mortiz's comment suggesting to use '/root' instead of '/nonexistent'? [13:18:16] ottomata: we chatted about that offline and decided to stick with /nonexistent [13:18:25] ok [13:18:34] ok gr8 no patch from me then just a big ol' +1 [13:19:05] ack morit.zm has also given +1 so will merge now [13:19:08] ty [13:19:59] merged [13:20:48] gr8 [13:32:31] 10Analytics-Clusters, 10Analytics-Kanban: Disk filling up on `/` on an-coord1001 - https://phabricator.wikimedia.org/T279304 (10BTullis) That manual restart seems to be working well with these vanilla log4j2 configurations. I'll prepare a patch to deploy these with suitable tweaks, instead of the existing log4... [13:39:41] hi teammmm [13:41:21] Hello mforns :-) [13:41:42] :] [13:44:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refactor profile::analytics::cluster::users - https://phabricator.wikimedia.org/T287063 (10Ottomata) [13:47:14] elukey: o/ we should probably set krb: present on these analytics system users, right? [13:47:20] including 'analytics' [13:47:29] actually, i don't know what setting krb: true in admin data.yaml even does [13:47:36] is itjust for accounting or does it actually do something? [13:47:38] hhi mforns ! [13:47:50] krb: present* [13:47:59] heyy [13:48:12] ottomata: o/ the krb: present is only used by the offboard python script [13:49:45] it should be used in the future as part of the account consistency script [13:49:48] so that we can validate Kerberos accounts against data.yaml [13:50:41] ah ok [13:50:51] gr8 so, should we put it on the system users? [13:51:02] they do have krb principals [13:51:09] mforns: btw, re hive connection [13:51:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/698808 [13:51:17] will let us define connections like that in puppet [13:51:31] that way we don't have to add them via the airflow UI for every instance [13:51:41] need to follow up with that [13:52:55] moritzm: should we set krb: present for the system users that have krb principals? [13:53:53] yeah, let's do that, so that we can also account for them in the consistency check later [13:55:14] ok gr8 [14:02:26] btullis: got a quick sec for a user/group/airflow naming brain bounce? [14:02:32] these names are gonna stick so I want to see what you think [14:02:44] re https://gerrit.wikimedia.org/r/c/operations/puppet/+/708159 [14:09:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refactor profile::analytics::cluster::users - https://phabricator.wikimedia.org/T287063 (10Ottomata) p:05Triage→03High [14:14:20] gmodena: fab, yt? [14:16:10] yes [14:16:39] working on your airflow instances :) [14:16:52] i have to name them and make new user groups [14:17:00] in https://gerrit.wikimedia.org/r/c/operations/puppet/+/708159/3/modules/admin/data/data.yaml [14:17:06] i'm making two new groups [14:17:13] analytics-research-admins and analytics-platform-eng-admins [14:17:37] with system users analytics-research and analytics-platform-eng that will run the airflow isntances [14:17:56] the names of the airflow instances will be just 'research' and 'platform'eng [14:18:03] so airflow-research, etc. [14:18:09] these names are going to stick [14:18:14] for a long time [14:18:20] so wanted to see if you all had prefs or thoughtrs on them [14:37:38] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) Having discussed this in #wikimedia-serviceops I have decided to revert back to the CNAME option for the time being. The... [14:44:50] fab: ^^ ? :) [14:45:26] ottomata: Sorry I missed this message. Checking now. [14:45:44] btullis: s'ok, lemm eknow if you wanna jump in bc to dsiuss [14:45:45] disuss [14:46:27] 10Analytics, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): android image_recommendation_interaction error - https://phabricator.wikimedia.org/T284620 (10Ottomata) Let's leave the schema for now, you can leave the patch open and we'll see about it later. The stream config patch should be ok to g... [14:55:36] ottomata: Yes, let's head to the bc if you don't mind. I think I get it, but it's the notion of instances that you might be able to make clearer for me. [14:55:40] k [15:01:41] ottomata ack [15:02:04] ^ clarakosi [15:05:42] ottomata names sounds good to me, or at least I have no other alternatives in mind :). Q: if a dag runs as analytics-platform-eng and stores data to HDFS, will the data be owned by group analytics-platform-eng-admins ? [15:11:28] no, i twill be owned by analytics-platform-eng:analytics-platform-eng [15:12:03] hmm hang on [15:12:06] trying to remmeber how this works now [15:12:43] oh yes [15:12:50] you'll have to sudo -u analytics-platform-eng to work with the data [15:13:06] it should be readable though by analytics-platform-eng-admins [15:13:07] ... [15:13:09] making sure [15:15:05] gmodena: i think that the default will be that analytics-platform-eng-admins cannot read the data, but what we do, is e.g. hdfs dfs -chgrp analytics-platform-eng-admins any parent directories that your jobs will create data in [15:15:17] newly created dirs in files in those hdfs dirs will then be gropu owned by analytics-platform-eng-admins [15:15:20] and will be readable by them [15:15:33] so that might be a bit of a manual step for new generated datasets [15:15:46] but only needed once (perhaps it could be done in the airflow job?) [15:16:09] ok, merging that and continuing [15:16:10] :) [15:16:11] ty [15:16:56] ottomata thanks for clarifying! [15:18:45] elukey: qq [15:18:46] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_keytab_for_a_service [15:18:54] i don't understand where the username is specified in that [15:20:37] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10BTullis) a:03BTullis @Ottomata drew my attention to this task and think I may have an idea why it's happening, so I'm going to try to fix it at the same time as T279304. Ultimatel... [15:21:06] 10Analytics, 10Dumps-Generation: Monthly Wikimedia pageviews dumps cann't be decompressed - https://phabricator.wikimedia.org/T287565 (10MusikAnimal) #pageviews-anomaly is intended for anomalies with the pageviews data itself, such unusual spikes in traffic. This task describes an issue with dumps generation. [15:28:24] elukey: in that example, should 'host' be the username? [15:28:25] sretest1001.eqiad.wmnet,create_princ,host [15:28:27] e.g. [15:28:44] an-airflow1002.eqiad.wmnet,create_princ/analytics-research [15:28:44] ? [15:29:31] hmmi think so! [15:29:33] the example is not right [15:29:33] an-airflow1001.eqiad.wmnet,create_princ,analytics-search [15:29:34] an-airflow1001.eqiad.wmnet,create_keytab,analytics-search [15:29:36] for example [15:29:42] righhht! ok [15:29:48] editing [15:29:54] perfect :) [15:29:56] ottomata could you also tag clarakosi for airflow related work/pings? ty! [15:30:04] gmodena: def will do! [15:32:32] * clarakosi reads scrollback [15:33:42] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) Created kerberos principals and keytabs: ` [@krb1001:/home/otto] $ cat airflow-keytabs.list an-... [15:37:05] thanks for getting this going ottomat [15:46:12] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10ssingh) Hi @BTullis: It's been a while since I ran a query on Hive but I am happy to help test this, so please let me know. Thanks! [15:49:56] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) Created airflow databases on an-coord1001: `lang=sql CREATE DATABASE airflow_research; CREATE U... [15:56:05] 10Analytics, 10Analytics-Kanban: jupyter notebook causing syslog/etc.. to fill up with error messages - https://phabricator.wikimedia.org/T287339 (10BTullis) @ChristineDeKock - Could you please ping me the next time you run this (or a similar) notebook please? I'm interested to catch it while it's running so t... [16:00:33] yes indeed thanks ottomata, these names sound good to me too. [16:03:20] ottomata: looks like your change is having issues rolling out [16:03:41] https://phabricator.wikimedia.org/P16922 [16:03:56] jbond: yeah jynus alerted me too it looksed like things were slowly\ going [16:04:03] oh [16:04:07] yeah [16:04:13] which host is that? [16:04:32] there are processes running as that user [16:04:34] that is an-worker1135 but there are a few other analytics serveres showing up in icinga [16:04:36] i had to manually fix some of them [16:04:40] yeah tha tmakes sense [16:04:41] i'll follow up [16:04:45] in standup now [16:04:45] thank yopu [16:04:49] no probs [17:50:38] 10Analytics, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): android image_recommendation_interaction error - https://phabricator.wikimedia.org/T284620 (10Sharvaniharan) Thank you @Ottomata [18:08:08] ottomata: I'm writing the docs about the Airflow POC, and I realized part of the POC has also been setting Airflow up via puppet no? Do you want me to add any comments regarding that to the docs? [18:13:08] mforns like this? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [18:13:55] ottomata: that is great as a reference, I will add the link [18:14:12] ottomata: is there any observation, conclusion or caveat you want to mention? [18:14:20] those could go into the design doc [18:14:27] to start conversations [18:15:46] hm [18:16:02] mforns: i'm not sure, maybe something about the system users and groups? [18:16:18] how we have to create a new set of system users and admin groups for each instance? [18:16:18] yes, that'd be nice [18:16:54] ottomata: I will add that, and then let you vet and modify if you want [18:17:31] ok [18:23:15] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-razzi: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) 05Open→03Resolved [18:23:18] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10razzi) [18:24:19] 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10razzi) [18:24:35] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10razzi) 05Open→03Resolved a:03razzi [18:57:32] * razzi out for lunch [20:28:11] razzi: do you know what I need to do to get PCC to work with a newly puppetized host? [20:28:20] i just ran puppet for the first time on an-airflow1002 [20:28:27] works fine [20:28:30] but PCC gives me [20:28:37] v [20:28:38] Unable to find fact file for: an-airflow1002.eqiad.wmnet [20:29:17] oh perhaps i foudn it [20:29:18] https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [20:29:37] i don't think i have memberhsip in that cloud vps project [20:39:09] 10Analytics-Dashiki, 10Analytics-Radar, 10CX-analytics, 10Language-analytics: The language-reportcard.wmflabs.org/cx2 chart is stuck at 2018-10-21 - https://phabricator.wikimedia.org/T208324 (10nshahquinn-wmf) 05Open→03Resolved The dashboard is currently working fine, so obviously this has been resolved. [21:32:19] ottomata: I remember mucking around with that compiler-update-facts script, but I don't remember anything more than that [21:33:10] yeah, I remember it was really slow and I had to rerun parts of it, so I ended up looking at the script source and running some commands manually [22:04:40] (03PS1) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 [22:05:35] (03CR) 10jerkins-bot: [V: 04-1] Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [22:11:56] (03PS2) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 [22:12:38] (03CR) 10jerkins-bot: [V: 04-1] Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [22:20:24] (03PS3) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 [22:21:03] (03CR) 10jerkins-bot: [V: 04-1] Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [22:45:13] (03PS4) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 [23:53:49] thanks razzi, brooke ran it for me! i've done the first puppet runs on the new airflow nodes you set up, thanks for that!