[00:02:19] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:48:03] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @elukey @razzi I have merged the above patch, which allows dbstore1007 to be reimaged without formatting its /srv. It needs to have their mysql in... [05:33:17] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) I have actually stopped mysql, as they are not replicating anyways. So we can reimage this host anytime. [06:08:51] !log restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it) [06:08:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:11:20] on dmesg I found: [06:11:21] cgroup: fork rejected by pids controller in /system.slice/hadoop-yarn-nodemanager.service [06:12:05] but this was may 5th, so it survived that event [06:13:02] I think it was a one off, let's keep an eye on it [06:13:18] (didn't investigate exactly when it happened but it was days ago) [06:54:11] 10Analytics: Requesting Kerberos password - https://phabricator.wikimedia.org/T284022 (10Cervisiarius) It works, thanks! [07:01:02] Good morning [07:07:43] bonjour [07:42:22] 10Analytics, 10Analytics-Kanban: Move WikimediaEventUtilities logging to Slf4j - https://phabricator.wikimedia.org/T284537 (10JAllemandou) a:03JAllemandou [07:55:42] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10lucyblackwell) Approved! [07:56:44] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) [07:58:06] 10Analytics-Radar, 10Product-Analytics, 10Product-Data-Infrastructure, 10Language-Team (Language-2021-April-June): All events in the contenttranslationabusefilter data stream failing validation - https://phabricator.wikimedia.org/T283872 (10Pginer-WMF) p:05Triage→03Medium [08:24:04] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Volans) Manual changes to Netbox have been done by me and @elukey, namely: - deleted the IPv4 and IPv6 from dbstore1007 - selected the VLAN for private1-d-eq... [08:28:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10gmodena) Hey @Ottomata > [x] DAG dir and distribution > We'll need to set a directory in which airflow scheduler will... [08:55:04] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) @Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but it should run fine from now on in theory (just... [08:58:13] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) >>! In T283125#7141412, @elukey wrote: > @Marostegui the new ips seems to work as expected, if you want to kick of a reimage please go ahead but i... [08:59:06] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) I don't think it is needed, we can proceed with what we have :) [09:21:22] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) MySQL started and catching up! Thank you all!! (including @volans!) I expect it to be in sync with the master by tomorrow - once done, I will enab... [10:59:40] ottomata is it corect to assume the ganeti vms won't have outbound ssh to stat nodes? [11:12:52] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) The patch has been merged and deployed, it will be effective within ~30 minutes from now. @schoen... [11:52:46] 10Analytics-Radar, 10Event-Platform, 10MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Technical-Debt (Deprecation process): extensions/EventBus - Use UserGroupManager instead of User group methods - https://phabricator.wikimedia.org/T281825 (10daniel) [13:11:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) :) > Do you plan on setting up a log shipper to ELK? I had not planned on it, but I suppose we could! All lo... [13:17:17] gmodena: no ssh that's right [13:17:22] they are just like any node in prod [13:17:49] gmodena: we could probably enable an rsync module like we do between the stat boxes, if you are thikning about copying files? [13:18:19] elukey: mornin! [13:18:23] about to do the roll restart of presto nodes [13:18:33] anything I need to know other than just run the cookbook? [13:21:47] ottomata: morning! Nope it should be very smooth [13:21:54] gr8 [13:22:37] !log roll restarting analytics presto-servers - T283067 [13:22:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:24:40] elukey: we were looking into airflow hdfs stuff yesterday [13:24:47] i guess this https://github.com/internetarchive/snakebite-py3/issues/8 is the main blocker for all that eh? [13:25:47] yes it is a mess, I tried a lot to make it work but I gave up.. there is no traction from upstream, and everytime the RPC format changes some adjustment will be needed.. [13:25:59] IIRC Airflow upstream told me that they would have moved to the hdfs client in pyarrow [13:26:21] https://issues.apache.org/jira/browse/AIRFLOW-2697 [13:26:22] (that uses the c bindings provided by hadoop, so way more flexible) [13:26:24] but no real work on it [13:26:55] there is a stale PR but it uh...doesn't use pyarrow..it uses hdfs3 which the JIRA descrption says it shouldn't use.. ? oh well [13:27:10] i think we are going to have to do some work to make that JIRA happen. [13:27:45] yep I agree, but it shouldn't be horrible, and there is plenty of support from upstream [13:28:16] yeha [13:30:10] ottomata: for the roll restarts, I learned from Moritz that a good last step is to run 'lsof -Xd DEL' to see if any openjdk-related file/lib/etc.. is still held by some old process [13:30:14] very handy [13:30:46] (Moritz runs it anyway before calling a roll restart done so there is a safety net if we forget :D) [13:31:16] it helps catching also manual processes that we forgot, issues with the cookbooks, etc.. [13:34:32] oh that is cool [13:34:39] https://www.irccloud.com/pastebin/oZaKpt3g/ [13:34:41] oops [13:34:44] -Xd DEL [13:35:12] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) [13:35:20] elukey: does ^ sound like an accurate summary ? [13:35:39] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) [13:35:55] ottomata: yep! [13:36:21] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) [13:40:31] elukey: same for zookeeper, ok to just do it? [13:41:19] ottomata: you can run the cookbook, there is a pre-step that asks to check the output of some commands (namely the current state of the cluster) [13:41:26] gr8 [13:41:33] if the status is good (one leader two followers) then it is safe [13:42:39] !log roll restart an-conf zookeepers - T283067 [13:42:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:53:12] ottomata: the same cookbook can also run on all druid nodes [13:54:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) Oh, another TODO: I think we'll need to puppetize a [[ https://airflow.apache.org/docs/apache-airflow/stable/s... [13:54:22] aye [14:21:14] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) Hey @Volans — just tried out some dashboards on Superset, works as expected — thanks a lot! 👏 [14:22:19] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) 05Stalled→03Resolved Great, resolving. [14:25:22] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) dbstore1007: s2, s3 and s4 is now up-to-date. GTID is in place. @razzi @elukey anything else to be done from our side before we can stop and rec... [14:39:01] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) Yep we need to merge https://gerrit.wikimedia.org/r/698729 and verify that everything works as expected :) [14:41:12] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Excellent - let me know when we can proceed with the future plans for dbstore1004. Feel free to close this task once you are done from your side.... [14:48:17] (03PS1) 10Mforns: Add wmdebannerevents schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698798 (https://phabricator.wikimedia.org/T282562) [14:49:24] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add wmdebannerevents schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698798 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [14:49:56] (03Merged) 10jenkins-bot: Add wmdebannerevents schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698798 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [14:56:20] (03PS1) 10Mforns: Add wmdebannerimpressions schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698802 (https://phabricator.wikimedia.org/T282562) [14:58:00] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add wmdebannerimpressions schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698802 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [14:58:29] (03Merged) 10jenkins-bot: Add wmdebannerimpressions schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698802 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [15:04:02] (03PS1) 10Mforns: Add wmdebannersizeissue schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698804 (https://phabricator.wikimedia.org/T282562) [15:05:10] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add wmdebannersizeissue schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698804 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [15:05:47] (03Merged) 10jenkins-bot: Add wmdebannersizeissue schema to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/698804 (https://phabricator.wikimedia.org/T282562) (owner: 10Mforns) [15:25:11] ottomata: Hey! Do you know if there are any specific requirements to add someone to a group for access in the analytics cluster? Specifically: https://gerrit.wikimedia.org/r/c/operations/puppet/+/698546 [15:26:15] I was looking at wikitech, but the best I can find seems to be https://wikitech.wikimedia.org/wiki/Analytics_Engineering [15:26:22] hmm [15:26:22] https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Team_specific_(they_do_not_grant_access_to_PII_data_on_Hadoop,_for_that_see_analytics-privatedata-users) [15:26:27] i don't think there are any specific requirements [15:26:37] if they already have shelll access, etc. i think that's all? [15:27:54] yep [15:27:55] i'll merge [15:28:06] ottomata: cool! Thanks! [15:28:10] tanny411: ^^ [15:30:37] gehel: thanks :) [15:30:59] tanny411: ottomata is the one who needs the thanks :) [15:31:36] ottomata: double thanks!! :D [15:31:45] yw! [15:34:56] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10mforns) [15:37:22] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @Dzahn While I have been able to access through Jupyter, I haven't been able to get the kerberos login. When I type in kinit it just hangs - Do I ne... [16:38:15] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) @JAnstee_WMF no you shouldn't. Where are you typing 'kinit'? Into an ssh terminal or into a Jupyter shell terminal? Also, how are you accessing Jup... [16:45:48] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: New Wikivoyages are only partially included in Stats - https://phabricator.wikimedia.org/T279564 (10razzi) 05Open→03Resolved [16:47:49] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/698325 (https://phabricator.wikimedia.org/T284389) (owner: 10Gerrit maintenance bot) [16:54:09] hello folks [16:54:14] if you are ok I'd deploy https://gerrit.wikimedia.org/r/c/operations/dns/+/698729 [16:54:36] that will move analytics-mysql and other tools to dbstore1007 [16:55:41] ok I take it as yes :D [16:56:57] !log move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop) [16:56:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:57:19] elukey: we're in retro sorry [16:58:37] ah right of course I have to change the firewall rules as well, new ip [17:05:17] yep now everything works :) [17:05:33] just tested analytics-mysql itwiki (--print-target shows dbstore1007) [17:06:04] razzi, ottomata - I think that we are ready to decom dbstore1004, but if you want to double check and confirm in the task it would be great :) [17:06:10] (so Manuel can proceed with decom) [17:08:02] tthanks luc! [17:08:03] a [17:17:31] Has anybody seen my dagbag? [17:17:59] hahaha [17:18:18] dagbagit you lost it?! [17:18:48] (just wanted to write that, the word dagbag makes me smile) [17:59:13] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) @Ottomata Entered kinit in terminal. Accessed Jupyter hub via localhost:8880 [18:00:33] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) > Entered kinit in terminal Which terminal? In your browser in Jupyter or via ssh? Might be hard to troubleshoot this async, wanna ping me on IRC in #... [18:02:18] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) > Accessed Jupyter hub via localhost:8880 Which stat box? [18:25:19] ottomata we have a dependency (mysql data ingestion) on stat1004 https://gerrit.wikimedia.org/r/c/operations/homer/public/+/649706 [18:26:20] but I don't want to add more hacky solution on top of it. I hope the airflow work gives us some momentum to refactor that bit. [19:00:40] gmodena: mayyyybe....the data can be written to mysql from a hadoop job as part of the pipeline? [19:00:59] kinda like we do for druid and for cassandra? [19:01:24] ebernhardson: maybe I've asked this befoer [19:01:44] but do you use hdfs + airflow? [19:01:47] and if so, how? [19:02:27] asking because https://phabricator.wikimedia.org/T284566 [19:17:44] ottomata for now we don't do much with airflow + hdfs. We have two test jobs, and both write to HDFS via spark [19:18:15] ottomata re writing to mysql from Hadoop, if that works for you we could propose the change to DBAs [19:18:54] we have a soft SLO in terms of write throghput that mysql can accept for that specific job [19:20:00] as long we satisfy that - and don't send contention through the roof - hopefully it could be acceptable [19:22:07] gmodena: we mostly would use the airflow hdfs integration for sensors [19:22:17] ottomata: hmm, we don't really do much with hdfs+airflow, we have a tiny thing written that uses the cli hdfs client as a sensor [19:22:34] ottomata we'd want that :) [19:22:41] ebernhardson: do you mostly just schedule based on time schedule? [19:22:55] ottomata: hive partitions [19:23:04] aye right [19:23:18] once there are partitions i guess you don't need hdfs as much [19:23:18] hm [19:23:39] ottomata currently our pipelines are local fs based (historic reasons), but we want to deprecate that in favour of keeping everything in HDFS [19:24:19] i suppose i had a bit of an ancilliary goal to organize everything into hive where possible, seems easier to share between things. We do have a couple things with the hdfs cli command, using this: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/plugins/wmf_airflow/hdfs_cli.py [19:25:00] intersting! [19:25:06] something like an rm becomes PythonOperator to invoke HdfsCliHook.rm(...) [19:25:28] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) For reference: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/plugins/wmf_airflow/hdfs_cli.py [19:25:56] using actual pyarrow would be nice, but it wasn't super easy when i looked at it and this was fairly easy to write and already worked :) [19:27:51] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) I deduced from your question that one terminal location was right and the other one wrong and have now been able to authenticate in jupyter - thanks... [19:39:56] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Dzahn) 05Open→03Resolved Tentatively closing since this sounds like issues are resolved. Feel free to reopen it if there is anything else missing. [19:46:37] 10Analytics: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10razzi) Following up from our airflow hang yesterday, here's a working plugin import (based off https://stackoverflow.com/a/66479399/1636613) From $AIRFLOW_HOME: plugins/hdfs_plugin/__init__.py: ` from airflow.... [19:50:15] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) @Marostegui we're ready to migrate over, so I'll mark this as done on our end and close it. Thanks for your help! [19:50:49] 10Analytics, 10SRE, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Ottomata) :) [19:53:58] razzi: mforns wanna continue the hdfs test in 7 mins (razzi and I usually have a sync at that time anyway) [19:55:56] ottomata: sgtm [20:13:48] 10Analytics: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10cchen) [20:45:32] ottomata: are you still on it_ [20:45:34] ? [20:45:43] ya [20:45:47] need help [20:45:48] bc [20:45:49] ok [21:03:49] 10Analytics: General Usage statistics for AQS - https://phabricator.wikimedia.org/T284610 (10Milimetric) [21:04:37] 10Analytics: General Usage statistics for AQS - https://phabricator.wikimedia.org/T284610 (10Milimetric) If you run this query on presto: ` select month, if(split_part(uri_path, '/', 5) = 'unique-devices', split_part(uri_path, '/', 5), concat(split_part(uri_path, '/', 5), ' ',... [21:49:00] 10Analytics: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10Ottomata) @mforns @razzi and I were able to get a working test of LocalExecutor + pyarrow.fs.HadoopFileSystem in a DAG working: `lang=python """ Create 10 tasks that use HadoopFileSystem that should run in par... [21:52:02] 10Analytics, 10Internet-Archive, 10Platform Engineering: Replace Airflow's HDFS client (snakebite) with pyarrow - https://phabricator.wikimedia.org/T284566 (10Ottomata) See https://phabricator.wikimedia.org/T284172#7144227 for an example of how to use pyarrow.fs.HadoopFileSystem to connect to HDFS (I did not...