[08:19:27] in case no one is aware, we're getting about cronspam from root on stat1005, from a job " /usr/local/bin/published-sync -q" It's about one every 15 minutes, can someone have a look today? [08:21:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10elukey) To keep archives happy - I took the liberty to disable cgi/cgid httpd modules and configs on an-web1001, those were related to the execution of old p... [08:21:45] !log disable mod_cgi/mod_cgid on an-web1001 (and remove cgi-perl related httpd configs/settings) [08:21:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:15:56] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Stop retaining all GuidedTour events [analytics/refinery] - 10https://gerrit.wikimedia.org/r/715967 (https://phabricator.wikimedia.org/T288416) (owner: 10Milimetric) [09:17:24] (03CR) 10Joal: [C: 03+1] "All good - merge as you need!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/715967 (https://phabricator.wikimedia.org/T288416) (owner: 10Milimetric) [09:36:34] Heya btullis - How are you doing today? Any news for me on our beloved cassandra thingy? [09:47:04] !log deployed refinery to sync sanitize allowlist, deleting event_sanitized data per decision in the task [09:47:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:51:29] joal: Working on it now. Will let you know asap. [09:51:46] ack btullis - thanks a lot :) [09:54:04] apergos: I'll look at this today. I think that it must be related to T285355 so ottomata may understand immediately what the issue is when he's online. [09:54:05] T285355: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 [09:54:37] thanks much! [10:00:54] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10BTullis) We're getting cronspam from the published-sync jobs running on the client and launcher nodes, when they're running their rsync tasks and publishing... [10:04:55] btullis: on an-web1001 there is a mess of perms :( [10:04:58] drwxrwxr-x 4 systemd-coredump wikidev 4096 Jun 23 2020 stat1008 [10:05:34] the dir is /srv/published-rsynced/, the user should be "stats" [10:08:14] elukey: Yes, spotted that. I'm just trying to work out how best to fix quickly. ryankemper and ottomata have been working on it, so I don't necessarily want to tread on their toes, but equally the cronspam should stop, so I'm trying to look at the best option. The user is specified here, but without a fixed uid. [10:08:14] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/statistics/manifests/user.pp#13 [10:09:06] There's a uid clash with an existing system user on an-web1. [10:09:13] https://www.irccloud.com/pastebin/qqLmxZIg/ [10:09:15] the script uses hard-links heavily, I recall to have cleaned up the dir a lot (some duplication occurs) [10:09:59] I think that we can chown stats: -R /src/published-rsynced [10:10:18] I think that I can just change the uid/gid for `systemd-coredump` and then change the uid/gid for stats, then chown the existing files/directories. [10:10:20] mmm there is the "2" dir that is weird [10:10:42] yeah maybe find -user etc.. is better, to limit it to systemd-coredump [10:11:05] mmm wait why changing uids? [10:12:07] I was thinking to [10:12:07] sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \; [10:12:25] it should fix the problem in theory [10:12:35] btullis: --^ [10:15:11] OK. I was just trying to work out whether it needs to be the same uid on thorium and an-web1001, given that they're both rsync targets at the moment. But I think you're right. [10:16:45] ah okok! [10:16:51] so from the script on the stat100x hosts I see [10:16:52] dest='analytics-web.discovery.wmnet::published-destination/stat1008/' [10:17:04] that points to an-web1001, so we should be fine [10:17:12] (last famous words :D) [10:17:21] Yeah, they're not both rsync targets. I haven't been following the ticket closely enough. [10:17:22] 10Analytics-Radar, 10Privacy Engineering, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata Analytics: Privacy Policy Review for Global South Wikidata edits and active editors datasets - https://phabricator.wikimedia.org/T291186 (10GoranSMilovanovic) [10:18:16] !log btullis@an-web1001:~$ sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \; [10:18:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:20:34] Yeah, so it looks like the rsync runs from thorium to an-web1001 just preserved the uid/gid values of the files, instead of chowning them to the stats user. [10:50:25] last cronspam was > 30 mins ago so whatever you did, looks like it worked! \o/ [11:05:54] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10BTullis) I chowned the files on an-web1001 with `sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \;` I hesitated to make sure t... [11:06:20] apergos: Thanks for confirming. [11:13:31] joal: I have finished importing the new snapshot of `local_group_default_T_pageviews_per_project_v2/data`, created after its full repair. [12:15:16] Ack btullis ! testing now [12:16:20] btullis: the test that was failing previsouly now doesn't ! \o/ [12:17:47] Relaunching a long running batch of tests across dates, but I feel confident the repair did the job [12:18:21] btullis: This makes it for a repair + re-load of all data tables (except local_group_default_T_pageviews_per_project_v2/data, already done) [12:18:29] I'm sorry for the overwork btullis [12:35:21] joal: No need to apologise :-) I'll draw up another plan then and hopefully we can run a full repair of either aqs1004 or aqs1007 over the weekend. [12:35:42] that'd be great btullis :) [12:38:01] The question in my mind is whether we re-use the `transfer.py` method that we did last time, or whether we look at implementing a temporary puppet change to enable us to use rsync. [12:38:45] btullis: I can't really help here as you're deep in SRE-land :) You should confirm with Andrew :) [12:39:26] Will do. [13:29:14] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10elukey) >>! In T285355#7361119, @elukey wrote: > Side note: the following httpd config seems very stale and not used, can we get rid of it? (I recall I had... [13:39:54] ottomata: good morning - I have some nits for you on the events tickets if you have a minute :) [14:21:56] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) Thank you both! > I hesitated to make sure that we didn't have to keep uids in sync Ah, we should probably do this. We recently gained the ability... [14:31:48] joal: i took today off to get a bunch of errands done so might not be able to go deep! [14:31:59] if you like , wanna send me an email? or we can talk monday [15:05:44] /buffer 17 [15:05:53] sorry [15:07:40] 🙂 I thought we'd started playing https://en.wikipedia.org/wiki/Word_Association [15:10:46] 10Analytics-Clusters, 10Cassandra, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) We have decided to use rsync for the next transfer from the v2 cluster to the v3 cluster. As such I'm proposing to c... [15:15:01] !log btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755) [15:15:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:15:06] T249755: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 [15:20:45] joal: I have started the full repair of aqs1004 ^ [15:22:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10BTullis) [15:22:29] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10vm-requests: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) 05Open→03Resolved Cookbook completed successfully, without a... [15:31:52] 10Analytics: Agree on a repository structure for Airflow-related code - https://phabricator.wikimedia.org/T290664 (10mforns) Thanks @ACraze for your thoughts! I think we people who spoke in this task have a common understanding and preference for option 1. People I spoke outside the task, are also not opposed to... [15:37:49] 10Analytics: Agree on a repository structure for Airflow-related code - https://phabricator.wikimedia.org/T290664 (10mforns) Ideas for naming the repo? [15:54:22] 10Analytics-Clusters, 10Cassandra, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) I think that we might need to remove those previously created snapshots, because the usage on aqs1004 and aqs1007 is... [16:03:00] !log Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755) [16:03:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:04] T249755: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 [16:07:18] 10Analytics-Clusters, 10Cassandra, 10Data-Engineering, 10Data-Engineering-Kanban, and 3 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10BTullis) Disk space reclaimed. {F34646232} [16:08:53] (03CR) 10Mforns: [C: 03+1] "Oh, @MNeisler, thanks for the clarification." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [16:10:36] (03CR) 10Mforns: [C: 03+1] "Please, @MNeisler, let me know if this is the final version of this change, and I will merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [16:35:23] 10Analytics: Reportupdater should stop running a job after some fixed number of failures - https://phabricator.wikimedia.org/T284037 (10mforns) @awight We are prioritizing the Airflow project, that will allow us to schedule jobs like reportupdater ones in a more robust and flexible way, and have a UI to manage/t... [16:37:18] 10Analytics: Anomaly detection alarms for the edit event stream - https://phabricator.wikimedia.org/T250845 (10mforns) p:05High→03Medium [16:39:39] 10Analytics: Finish document for the Airflow POC - https://phabricator.wikimedia.org/T285695 (10mforns) This was done in: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Workflow_management_tools_study [16:39:54] 10Analytics: Finish document for the Airflow POC - https://phabricator.wikimedia.org/T285695 (10mforns) 05Open→03Resolved [16:39:56] 10Analytics, 10Analytics-Kanban: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10mforns) [16:48:22] (03CR) 10Mforns: [C: 03+1] "LGTM +1" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/714875 (https://phabricator.wikimedia.org/T290074) (owner: 10Krinkle) [16:51:04] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/719290 (https://phabricator.wikimedia.org/T290469) (owner: 10Joal) [19:21:49] 10Analytics-Radar, 10MediaWiki-API, 10Patch-For-Review, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Mholloway) a:05Mholloway→03None [19:22:16] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Mholloway) 05Open→03Resolved [20:00:19] 10Analytics-Radar, 10Product-Analytics: Do the messages left for unregistered or logged-out IP editors get read by those editors? - https://phabricator.wikimedia.org/T291297 (10Whatamidoing-WMF) [20:24:30] 10Analytics, 10Product-Analytics, 10Editing-team (Tracking): Add MariaDB replicas to Superset - https://phabricator.wikimedia.org/T291195 (10mpopov) @elukey: Megan's most pressing use case is [[ https://www.mediawiki.org/wiki/Extension:DiscussionTools/discussiontools_subscription_table | discussiontools_subc... [20:42:11] 10Analytics-Radar, 10Privacy Engineering, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata Analytics: Privacy Policy Review for Global South Wikidata edits and active editors datasets - https://phabricator.wikimedia.org/T291186 (10JFishback_WMF) a:03Htriedman [20:42:34] 10Analytics-Radar, 10Privacy Engineering, 10WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata Analytics: Privacy Policy Review for Global South Wikidata edits and active editors datasets - https://phabricator.wikimedia.org/T291186 (10JFishback_WMF) p:05Triage→03Medium