[06:07:37] good morning :) [06:08:37] razzi,mforns - puppet was broken on an-launcher1002, I am fixing the code for RU (nothing major) but let's be more careful when merging next time :) [06:30:18] ok puppet running on an-launcher1002, I moved the hdfs rsync logs for RU to /tmp/reportupdater-logs [06:30:38] the dir has r+x for other as expected, but the subdirs don't [06:30:54] that is our dear umask [06:33:48] so --chmod D755,F644 is needed to hdfs-rysn [06:33:49] D755: Add visual error handling to dashboard and detail - https://phabricator.wikimedia.org/D755 [06:33:52] lol [06:41:13] PROBLEM - Check unit status of reportupdater-published_cx2_translations_mysql on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:43:09] temporarily stopped the RU timers, something is off [06:43:27] reportupdater-published_cx2_translations_mysql[12772]: /usr/bin/python3: can't open file '/update_reports.py': [Errno 2] No suc [06:47:45] it may be https://gerrit.wikimedia.org/r/c/operations/puppet/+/692909/5/modules/reportupdater/manifests/init.pp [06:48:02] now that I see when I unblocked puppet on an-launcher1002 some RU timers changed as well [06:50:04] o/ [06:50:27] dcausse: o/ [06:50:48] what was the purpose of the maintainance yesterday? did this affect hadoop jars or not at all? [06:51:34] we tried to reimage an-master1002 to Buster but we failed in the pre-steps (saving the HDFS fsimage etc..) so the maintenance was aborted [06:51:52] are you seeing weird issues? [06:52:31] elukey: yes but certainly be totally unrelated [06:53:15] the flink app is having hard times to send messages to some kafka-main topic, the stacktrace is not helpful sadly, I'll have to debug a bit more [06:53:28] super [07:02:37] ok I think I've fixed it with https://gerrit.wikimedia.org/r/c/operations/puppet/+/695078 [07:05:41] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight) @mforns Fantastic! I may be peeking too early, but found some strange HDFS file permissions. > hdfs dfs -ls /tmp/reportupdater/ > ls: Permission denied:... [07:08:19] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10elukey) @awight yes work in progress, I am fixing the code since it led to some unwanted side effects. The final dir will be /tmp/reportupdater-logs, but there a... [07:13:27] RECOVERY - Check unit status of reportupdater-published_cx2_translations_mysql on an-launcher1002 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations_mysql https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:34] gooood [07:25:20] mmmm one thing that I noticed the code review https://gerrit.wikimedia.org/r/c/operations/puppet/+/692909 is that the hdfs-rsync in refinery was used, not the one in hdfs-tools [07:25:27] I wasn't aware of a python version of it [07:33:17] ok going to write in the task [07:41:24] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10elukey) @mforns I did some follow ups to https://gerrit.wikimedia.org/r/692909: * moved the log dir to `/tmp/reportupdater-logs`, since the `bigtop::hadoop::di... [08:51:37] I found a discrepancy in analytics -> kafka-main connectivity that perhaps explain the problems I see [08:52:04] ahhhhh wait the new hosts! [08:52:08] I can't access kafka-main1004 nor kafka-main1005 [08:52:19] yes yes Keith added them yesterday [08:52:28] it just came to mind when you mentioned it [08:52:31] :) [08:52:59] there was a code review out for the firewall but probably it wasn't merged [08:53:02] checking [09:04:20] dcausse: we are missing https://gerrit.wikimedia.org/r/c/operations/homer/public/+/695192/, going to ask for a review + merge in a bit [09:04:35] elukey: thanks! [09:15:56] dcausse: done! [09:16:12] elukey: thanks for the quick fix! testing [09:23:14] works great now :) [09:26:39] nice [09:26:46] \o/ [09:57:27] (03CR) 10Silvan Heintze: [C: 03+2] Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [09:59:17] (03Merged) 10jenkins-bot: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [10:21:55] (03PS1) 10Ladsgroup: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/695040 (https://phabricator.wikimedia.org/T281356) [10:22:02] (03CR) 10Ladsgroup: [C: 03+2] Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/695040 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [10:23:11] (03Merged) 10jenkins-bot: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/695040 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [10:34:25] 10Analytics, 10Analytics-EventLogging: Replace Content::getNativeData() calls with TextContent::getText() in EventLogging - https://phabricator.wikimedia.org/T283671 (10Aklapper) [11:28:59] 10Analytics: Requesting Kerberos password for bumeh-ctr - https://phabricator.wikimedia.org/T283710 (10Bumeh-ctr) [12:03:55] 10Analytics, 10Analytics-SWAP: Notebook machine to double as RStudio Server? - https://phabricator.wikimedia.org/T190769 (10yuvipanda) PAWS now supports rstudio: https://hub.paws.wmcloud.org/hub/user-redirect/rstudio Code at https://github.com/toolforge/paws/blob/6719585734beab915db727338850597ba2842060/image... [13:06:05] mornin! [13:22:37] 10Analytics: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) a:05Ottomata→03None [13:23:15] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Goal, 10Services (watching): Modern Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10Ottomata) a:05Ottomata→03None [13:27:30] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Approved. [13:38:31] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T283190 (10Ottomata) Hello! Instructions weren't clear before, but we'd like to have all access request follow the same template. I just updated https://wikitech.wikime... [13:41:25] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) [13:41:52] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) @schoenbaechler I edited the task description, but there are still a few fields blank. Can you edit and fill out the rest? [13:42:46] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) [13:44:43] (03PS1) 10Ottomata: Revert "Update cassandra jobs for double loading" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695308 [13:47:05] (03PS1) 10Ottomata: Update changelog for 0.1.12 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695300 [13:47:55] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update changelog for 0.1.12 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695300 (owner: 10Ottomata) [13:48:10] Starting build #89 for job analytics-refinery-maven-release-docker [13:54:28] hello, oops [13:54:48] (03CR) 10Hnowlan: [C: 03+1] Revert "Update cassandra jobs for double loading" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695308 (owner: 10Ottomata) [14:01:16] Project analytics-refinery-maven-release-docker build #89: 09SUCCESS in 13 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/89/ [14:09:33] Starting build #47 for job analytics-refinery-update-jars-docker [14:10:01] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.12 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695332 [14:10:02] Project analytics-refinery-update-jars-docker build #47: 09SUCCESS in 29 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/47/ [14:16:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.12 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695332 (owner: 10Maven-release-user) [14:16:43] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10mforns) @elukey Thanks a lot for your fixes! We can switch to the scala one, that was just me being uninformed. Will prepare the patch. [14:17:06] !log Deployed refinery-source using jenkins [14:17:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:18:08] !log deploying refinery... [14:18:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:19:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "Update cassandra jobs for double loading" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695308 (owner: 10Ottomata) [14:31:49] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) [14:32:09] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) Thanks a lot @Ottomata — I filled it out! 👍 [14:54:38] (03PS1) 10Phuedx: universalLanguageSelector: Add skin and skinVersion properties [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/695350 [14:57:30] joal: o/ yt? [14:57:36] going to restart daily cassandra jobs [15:20:01] (03PS7) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [15:25:33] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:33:17] looking into this ^ [15:37:43] razzi: o/ did you open a task for yesterday's saveNamespace problem? If not I can create one, I think I have a high level idea of what happened [15:38:06] not yet, you can go ahead and create one elukey [15:46:55] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) [15:51:14] ack! [15:52:50] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) This is approved on GDI end however, I am not sure if you also need our Director (sbodington) to also sign off Also regarding NDA, @KFrancis Benjamin still... [16:00:23] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) p:05Triage→03High [16:01:30] razzi, joal --^ [16:01:51] thanks a lot elukey [16:02:23] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) [16:03:17] a-team make sure you join the meeting with the cto hangout, not standup [16:07:53] 10Analytics: hdfs dfsadmin saveNamespace fails on an-master1001 - https://phabricator.wikimedia.org/T283733 (10elukey) [16:08:39] joal: it seems that the default timeout is 45 seconds and we cross it when saving the fsimage :( [16:08:49] (timeout for health checks from zkfc -> namenode) [16:08:52] /facepalm :( [16:11:38] but all the logs make now sense, it was a matter of connecting the dots [16:11:55] razzi: it is a very "nice" use case to study, let me know later on if you have questions [16:12:14] not really sure what the solution could be [16:12:42] bumping the timeout may be a quick win but not sure about the consequences [16:21:34] https://apachecon.com/acasia2021/tracks/bigdata.html [16:22:48] :) [16:22:49] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10KFrancis) @JAnstee_WMF Hi Jaime, Benjamin is on the current WMF contractors list, so his NDA is covered under the contractor employee agreement. Please proceed with the a... [16:25:41] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Nuria) My thoughts on the proposals: - Anything related to fingerprinting is a slippery slope, today we toss it, and in the future we... [16:29:11] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) [16:30:59] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @KFrancis or @JAnstee_WMF I would need the expiration date for the contract so I can prepare the patch with that info. [16:33:11] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Ottomata does this user need to have kerberos? [16:34:08] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Yes, thank you. [16:34:24] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Basically, if ssh + analytics-privatedata-users, kerberos is needed. [16:38:16] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) thanks! [17:01:20] ottomata: just checking, but if i send a message to eventgate asking for a topic that doesn't exist yet, it should create the topic in kafka suitable for low volume traffic? [17:01:29] * ebernhardson didn't try yet [17:06:45] elukey: heya - do you remember if presto impersonates users, or if it's a single user with read writes? [17:07:28] joal: it does impersonate users yes (I have to verify but I am 99% sure) [17:08:07] 10Analytics-Radar, 10Reading Depth: Publish aggregated reading time dataset - https://phabricator.wikimedia.org/T230642 (10Groceryheist) 05Open→03Resolved [17:10:00] 10Analytics, 10Product-Analytics, 10Epic: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10Groceryheist) [17:10:13] 10Analytics, 10Product-Analytics, 10Epic: Add wikidata ids to data lake tables - https://phabricator.wikimedia.org/T221890 (10Groceryheist) 05Open→03Stalled I'm not available to work on this, @JAllemandou's data served my purpose but it seems like there was some interest in maintaining a table like this. [17:11:56] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10KFrancis) @Marostegui Hello, It's 8/31/2021 [17:21:57] joal: nice job witih all these deep dives!!!!!! <3 [17:22:11] thanks ottomata :) <3 [17:22:29] * joal likes to try to explain stuff I try hard to understand :) [17:23:25] joal: gonna restart the oozie jobs [17:23:45] are these [17:23:46] coord_unique_devices_daily [17:23:46] and [17:24:22] coord_pageview_top_percountry_daily [17:24:23] ? [17:29:31] !log killing and restarting oozie cassandra loader jobs coord_unique_devices_daily and coord_pageview_top_percountry_daily after revert of oozie job to load to cassandra 3 [17:29:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:29:44] yessir - excuse me I took a gentle break :) [17:30:04] ottomata: Thank ou for the restart :) [17:31:49] joal: are the queue names different now? [17:31:55] -Dqueue_name='production' \ [17:32:18] ottomata: it should be :) [17:32:28] that's correct ^^ [17:32:39] joal: about to start with commands on train etherpad [17:32:40] https://etherpad.wikimedia.org/p/analytics-weekly-train [17:32:43] lines 33 - 46 [17:32:47] can you double check for me? [17:33:58] this looks correct ottomata [17:34:12] ty [17:35:17] ok done [17:39:18] joal: cool about alluxio fuse [17:39:25] i wonder if it woudl be more stable than the hdfs fuse mount we use [17:39:37] I wish! [17:41:26] anything is more stable than the hdfs fuse mount :P [17:52:42] * elukey afk! [18:04:39] 10Analytics, 10Product-Analytics, 10Epic: Add wikidata ids to data lake tables - https://phabricator.wikimedia.org/T221890 (10JAllemandou) Actually this table is now production-style on the cluster, at path `hdfs:///wmf/data/wmf/wikidata/item_page_link`, or hive table `wmf.wikidata_item_page_link`. It is rel... [18:18:17] (03PS1) 10Ottomata: Fix for ProduceCanaryEvents exit val [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695443 (https://phabricator.wikimedia.org/T270138) [18:25:08] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) @KFrancis thanks for confirming that! @Ottomata Yes, Benjamin will need kerberos, he has also filed a separate [[ https://phabricator.wikimedia.org/T283710... [18:30:39] (03CR) 10Ottomata: [C: 03+2] Fix for ProduceCanaryEvents exit val [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695443 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [18:32:17] (03PS1) 10Ottomata: Update changelog for 0.1.13 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695444 [18:32:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update changelog for 0.1.13 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695444 (owner: 10Ottomata) [18:39:06] (03Merged) 10jenkins-bot: Fix for ProduceCanaryEvents exit val [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695443 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [18:41:42] (03Merged) 10jenkins-bot: Update changelog for 0.1.13 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/695444 (owner: 10Ottomata) [18:42:25] Starting build #90 for job analytics-refinery-maven-release-docker [18:56:37] Project analytics-refinery-maven-release-docker build #90: 09SUCCESS in 14 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/90/ [19:12:03] Starting build #48 for job analytics-refinery-update-jars-docker [19:12:34] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.1.13 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695458 [19:12:39] Project analytics-refinery-update-jars-docker build #48: 09SUCCESS in 35 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/48/ [19:14:02] !log deploying refinery and refinery source 0.1.13 [19:14:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:16:52] oops ^ didn't actually merge the refinery commit [19:18:28] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add refinery-source jars for v0.1.13 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/695458 (owner: 10Maven-release-user) [19:59:17] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Bumeh-ctr can you post your ssh key on wikitech with your bumeh-ctr account on your, user page for instance, so it can be verified? Alternatively, it can also... [20:30:18] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:42:24] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:53:10] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:33:26] 10Analytics, 10Analytics-SWAP: Notebook machine to double as RStudio Server? - https://phabricator.wikimedia.org/T190769 (10mpopov) This is awesome, thank you @yuvipanda! Couple of issues encountered so far. There's an issue with the latest stable release of RStudio and R 4.1 (https://twitter.com/kevin_ushey/... [21:42:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) @elukey @razzi Airflow [[ https://gerrit.wikimedia.org/r/c/operations/debs/airflow/+/693222 | deb ]] and [[... [22:48:18] 10Quarry, 10Cloud-Services: Quarry links to IRC are broken - https://phabricator.wikimedia.org/T283773 (10Ladsgroup) [23:58:28] 10Analytics, 10Data-release, 10Privacy Engineering, 10Research, 10Privacy: Apache Beam go prototype code for DP evaluation - https://phabricator.wikimedia.org/T280385 (10Isaac) Thanks @nuria for weighing in! > Anything related to fingerprinting is a slippery slope, today we toss it, and in the future we...