[00:24:33] (03CR) 10Ottomata: "One nit but looks great to me!" (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [00:48:33] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) Ok! @fkaelin @gmodena @Clarakosi I've set up airflow instances for you all! Consider them stil... [03:25:11] Hi, I got an email from a reaserch collaborator in EPFL saying I am using up most cores in stat1008 with a spark job. I wasn't actually running any job during that time. I usually run jobs from jupyter notebooks, it is possible something kept running in the background or something? Can those be killed? [03:25:12] Also, where can I check if my jobs are eating up too much compute? [04:08:03] So I killed it from htop, not sure if this is how it should be done and also not sure what caused it. A bit confused. [05:58:18] (03PS5) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 [06:02:06] (03CR) 10Sharvaniharan: "@Ottomata I have a minor question. The action_icon field can be an empty string sometimes and selection_token could be null. How do I spec" (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [06:35:48] tanny411: hi! So yes if you run a job from jupyter notebooks there is usually a spark driver running on the node. What I suspect it happened is that you ran spark in local mode (not using yarn or similar), is it possible? In that case most of the compute is done locally [06:36:57] Im sure I used yarn since I use wmfdata to start sessions in my notebooks. [06:38:06] tanny411: mmm then maybe the driver needed to collect a lot of data on the node or similar? [06:39:26] the host was used but not completely https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=stat1008&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-6h&to=now [06:39:42] (you can see cpu usage and load) [06:41:03] elukey: could be collect, but I dont recall collecting huge data. My kernels were also terminated, so i believe the jobs got ghosted somehow (probably when connection to stat was lost?) [06:43:05] tanny411: in theory the notebook can keep running without any problem, but it was definitely doing something in the background, maybe it was stuck somehow.. next time if happens ping us so we can check (if anybody is around) [06:44:18] elukey: will ping, thanks! [09:50:50] 10Analytics, 10Analytics-Wikistats: Translations? - https://phabricator.wikimedia.org/T287661 (10Sabeloga) [09:58:01] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 4 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10daniel) [09:58:59] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 4 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10daniel) Tagging this as a Code Jam candidate [11:40:38] milimetric: continuing our Gerrit thread, I checked and my team can't CR+2 in reportupdater-queries yet. I think it's because we need "Label Code-Review -2 2" like on https://gerrit.wikimedia.org/r/admin/repos/mediawiki,access [11:41:57] (03PS1) 10Awight: Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/708633 [11:42:16] ^ that might fix [11:46:41] morning! [11:53:49] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Review access change [analytics/reportupdater-queries] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/708633 (owner: 10Awight) [11:55:02] thx awight! Where did you find what permission that was? I was trying to copy the permissions we had and I guess this one must've been inherited from somewhere I didn't look [11:56:27] mediawiki repo makes a lot more sense to look at, should've done that [12:26:54] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 5 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10daniel) [12:43:05] milimetric: Basically throwing knives in the dark :-) [12:43:44] These are reasonable docs though: https://gerrit-review.googlesource.com/Documentation/access-control.html [13:38:54] 10Analytics-Radar, 10ChangeProp, 10Event-Platform, 10Platform Engineering, and 5 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10Ottomata) +1 [13:51:33] 10Analytics, 10Analytics-Wikistats: wikistats: montly pageview dumps are not bz2 files - https://phabricator.wikimedia.org/T287684 (10Radim.kubacki) [13:55:15] (03CR) 10Ottomata: "> @Ottomata I have a minor question. The action_icon field can be an empty string sometimes and selection_token could be null. How do I sp" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [13:55:53] (03CR) 10Ottomata: "> What you want is an optional value" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [14:11:54] (03PS6) 10Fdans: Adapt wiki selector to allow more than one wiki [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/700098 (https://phabricator.wikimedia.org/T285050) [14:12:27] (03CR) 10jerkins-bot: [V: 04-1] Adapt wiki selector to allow more than one wiki [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/700098 (https://phabricator.wikimedia.org/T285050) (owner: 10Fdans) [14:19:32] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) [14:20:18] (03PS7) 10Fdans: Adapt wiki selector to allow more than one wiki [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/700098 (https://phabricator.wikimedia.org/T285050) [14:21:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) I have deployed the CNAME patch and checked that it has been correctly applied. ` btullis@marlin:~/wmf/dns$ for i in 0... [14:28:21] (03CR) 10Awight: "Yes this patch is still recommended, it's lingering just cos I'd like the data owner to merge so they can monitor." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/682747 (https://phabricator.wikimedia.org/T193169) (owner: 10Awight) [14:51:13] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) I need to follow up and make sure database backups and replication work properly, but I think I... [14:57:51] 10Analytics: Delete HDFS raw *_camus directories 60 days after July 12 (after 2021-09-10) - https://phabricator.wikimedia.org/T287685 (10Ottomata) [15:05:08] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) [15:26:09] mforns: o/ yt? [15:26:17] i'm doing some puppet cleanup and am wondering about [15:26:37] https://gerrit.wikimedia.org/r/c/operations/puppet/+/552082/4/modules/profile/manifests/analytics/refinery/job/druid_load.pp#47 [15:26:41] https://phabricator.wikimedia.org/T229674 [15:26:52] and if I should reove the declaration of the absented netflow-sanitization job [15:35:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) Some documentation: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin [15:36:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) Removing raw data we have stopped importing: ` sudo -u hdfs hdfs dfs -rm -R /wmf/data/raw/eventlogging_client_side sudo -u hdfs hdfs dfs -rm -R /wmf/data/raw/mediawiki_job ` [15:38:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) Removing old camus work job state dirs ` sudo -u hdfs hdfs dfs -rm -R /wmf/camus ` [15:50:41] Hi all, good mornin [15:51:41] We are almost at 100% java services restarted for T283067 - Service restarts of analytics services for Java security updates (8/11). The last ones remaining are aqs, druid, and an-druid [15:52:13] To my knowledge, all of these services are well clustered and the cookbooks will restart them gracefully, so we can do so at any time (unless we're serving unusually high traffic) [15:53:29] 10Analytics, 10Analytics-Kanban, 10WMDE-TechWish: Deployment access request for some analytics repos - https://phabricator.wikimedia.org/T274880 (10awight) >>! In T274880#7221451, @Milimetric wrote: > This should be done, but I saw reports of folks not being able to +2 despite being in the proper gerrit grou... [15:53:56] Unless anybody objects, I will restart them over today and tomorrow (I will post the cookbook commands before I run them, and wait for somebody to confirm) [15:54:33] (03PS1) 10Ottomata: Remove refinery-camus module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) [15:56:14] (03PS1) 10Ottomata: Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) [15:56:16] (03CR) 10jerkins-bot: [V: 04-1] Remove refinery-camus module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:56:30] (03PS2) 10Ottomata: Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) [15:57:25] (03CR) 10jerkins-bot: [V: 04-1] Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:58:23] (03PS2) 10Ottomata: Remove refinery-camus module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) [16:00:41] ottomata: wow --^ [16:00:50] camus clean up?? \o/ [16:00:58] (03PS3) 10Ottomata: Refine - replace default formatters with gobblin convention [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708787 (https://phabricator.wikimedia.org/T271232) [16:01:03] elukey: surely! :) [16:01:23] razzi: +1 [16:01:24] ty [16:01:52] razzi: +1, but IIRC druid public was already done no? (it only misses zookeeper restarts) [16:04:02] 10Analytics-Clusters, 10Analytics-Kanban, 10SRE: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) 05Open→03Resolved [16:04:12] 10Analytics, 10Analytics-Kanban: Refactor profile::analytics::cluster::users - https://phabricator.wikimedia.org/T287063 (10Ottomata) 05Open→03Resolved [16:04:20] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, and 2 others: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) 05Open→03Resolved [16:28:28] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10BTullis) This appears to work as expected. The only question I have is whether the users would prefer more feedback about the renewed ticket lifes... [16:35:32] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10elukey) I think that the two lines ` Renewing existing Kerberos ticket in the credential cache: krenew: renewing credentials for btullis@WIKIMEDI... [17:02:30] (03CR) 10Ottomata: "This change is ready for review." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702668 (https://phabricator.wikimedia.org/T285692) (owner: 10Mforns) [17:32:16] (03PS8) 10Fdans: Adapt wiki selector to allow more than one wiki [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/700098 (https://phabricator.wikimedia.org/T285050) [17:32:51] (03PS13) 10Mforns: Add airflow DAG for anomaly detection (POC) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/702668 (https://phabricator.wikimedia.org/T285692) [17:49:14] 10Analytics, 10Dumps-Generation: Monthly Wikimedia pageviews dumps cann't be decompressed - https://phabricator.wikimedia.org/T287565 (10fdans) a:03fdans [17:49:20] 10Analytics, 10Analytics-Wikistats: Translations? - https://phabricator.wikimedia.org/T287661 (10fdans) a:03fdans [17:49:27] 10Analytics, 10Analytics-Wikistats: wikistats: montly pageview dumps are not bz2 files - https://phabricator.wikimedia.org/T287684 (10fdans) a:03fdans [17:51:11] (03PS1) 10Fdans: Release 2.9.1 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/708802 [17:52:04] (03CR) 10Fdans: [V: 03+2 C: 03+2] "Self-merging for deployment" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/708802 (owner: 10Fdans) [17:54:12] (03Merged) 10jenkins-bot: Release 2.9.1 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/708802 (owner: 10Fdans) [18:06:04] AQS metrics: https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 [18:06:33] Traffic is low and metrics look healthy, going to kick off the java restarts [18:12:18] !log sudo cookbook sre.aqs.roll-restart aqs [18:12:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:12:44] I'm at the >>> Please test aqs on the canary step, going to see if I can find how to do that on wikitech [18:12:54] it's probably not necessary, but I want to know how to do this generally [18:14:18] (03PS1) 10Sharvaniharan: Minor fix [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708808 [18:14:53] (03CR) 10jerkins-bot: [V: 04-1] Minor fix [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708808 (owner: 10Sharvaniharan) [18:17:19] aqs canary seemed ok, proceeding with the rest of the nodes [18:18:09] (03CR) 10Sharvaniharan: "> Patch Set 5:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:18:51] (03Abandoned) 10Sharvaniharan: Minor fix [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708808 (owner: 10Sharvaniharan) [18:20:24] (03CR) 10Mholloway: [C: 03+1] Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:21:05] Ok, the cookbook completed, but I realized / remembered aqs is a nodejs service, and it's cassandra on the cluster that runs java [18:21:19] razzi@aqs1004:~$ sudo lsof -Xd DEL [18:21:26] java 25398 cassandra DEL REG 9,0 659856 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/nashorn.jar [18:21:26] ... [18:22:11] Luckily there's a cassandra cookbook too! cookbooks/sre/cassandra/roll-restart.py [18:23:39] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10BTullis) Thanks @elukey. Yes, I totally get what you mean about the lack of clarity. I just couldn't see (without looking at the source) what the... [18:24:15] (03CR) 10Mholloway: "Ah, sorry, forgot about the open discussion above." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:24:54] "restbase" is the aqs cassandra cluster, right? https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1 [18:30:59] (03PS6) 10Mholloway: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:32:18] (03PS7) 10Mholloway: Migrate MobileWikiAppNotificationInteraction from legacy to MEP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:33:54] (03CR) 10Mholloway: [C: 03+1] "> Patch Set 5:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:35:55] (03CR) 10Mholloway: [C: 03+1] "> Patch Set 7: Code-Review+1" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:38:41] (03CR) 10Sharvaniharan: "> Patch Set 7: Code-Review+1" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (owner: 10Sharvaniharan) [18:41:28] (03PS1) 10Ottomata: Remove references to camus [analytics/refinery] - 10https://gerrit.wikimedia.org/r/708816 (https://phabricator.wikimedia.org/T271232) [18:42:27] (03PS2) 10Ottomata: Remove references to camus [analytics/refinery] - 10https://gerrit.wikimedia.org/r/708816 (https://phabricator.wikimedia.org/T271232) [18:47:21] (03PS8) 10Sharvaniharan: Migrate MobileWikiAppNotificationInteraction from legacy to MEP Bug: T287652 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (https://phabricator.wikimedia.org/T287652) [18:55:57] 10Analytics, 10Analytics-Wikistats: Translations? - https://phabricator.wikimedia.org/T287661 (10fdans) 05Open→03Resolved Hi @Sabeloga, my apologies: until today there were some issues related to the deployment of Wikistats that prevented the release of new languages. I haven't had the bandwidth to deal wi... [18:58:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) Alright, I've gone through wikitech and updated relevant references to Camus. There are 3 outstanding patches to refinery-source and refinery about removing Camus. Once tho... [19:08:53] Going to get lunch then getting tea, will be out for a couple hours, but will have my phone so ping if you need me! [19:31:33] 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) a:05klausman→03razzi [19:32:48] 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) [19:33:05] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) [20:18:19] (03CR) 10Mholloway: "> Patch Set 3:" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [20:30:12] (03CR) 10Ottomata: "check out the normalized_host field on the wmf.webrequest table:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [21:33:00] (03PS4) 10MewOphaswongse: Add a link: Update schema to support edit mode and link inspector toggles [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) [23:09:18] (03CR) 10Nettrom: [C: 03+1] "Latest patch set also looks good to me!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [23:17:59] (03CR) 10Nettrom: [C: 04-1] "Hang on, do we also want to add T287121 to this patch so we only update the version once?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [23:27:52] 10Analytics-Radar, 10Product-Analytics, 10Growth-Team (Current Sprint): Add geolocation information to Growth schemas - https://phabricator.wikimedia.org/T287121 (10nettrom_WMF) @mewoph : Can we add the reference to `client_ip` to [[https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/704402]] so we al...