[00:22:26] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:24:04] (03PS14) 10Andrew Bogott: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 [03:24:53] (03CR) 10jerkins-bot: [V: 04-1] test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 (owner: 10Andrew Bogott) [03:26:54] (03CR) 10Andrew Bogott: "recheck" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 (owner: 10Andrew Bogott) [05:57:56] (03PS15) 10Andrew Bogott: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 [05:58:01] (03PS1) 10Andrew Bogott: Test run routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720843 [07:21:16] 10Analytics, 10Analytics-Kanban: Fix `wmf.editors_daily` data deletion - https://phabricator.wikimedia.org/T290093 (10JAllemandou) I confirm data got deleted! [08:57:28] Hi btullis - How are you doing on the cassandra loading front? May I help with anything? [09:18:29] Hi joal. I've completed loading the smaller tables from 3 out of 4 of the snapshots. Smaller meaning - all except `local_group_default_T_pageviews_per_article_flat/data`. [09:18:57] woah btullis - Even mediarquest_per_file! [09:19:07] I'm about to start loading the smaller snapshots from the 4th snapshot, at which time I'll be ready for you to start consistency checking all of them. [09:19:07] this one is big as well! [09:19:16] \o/ [09:19:40] Yes, `local_group_default_T_mediarequest_per_file/data` has been loaded from 3 out of 4 snapshots too. [09:20:31] this is great btullis :) [09:20:50] There was the one unexpected error/warning from `local_group_default_T_pageviews_per_project/data` on `aqs1011` / `cassandra-a` so I've marked it to return to and check. [09:21:13] ack - we might be willing to rerun [09:22:30] Thanks. I'll start the loading from `aqs1011`/ `cassandra-b` now and let you know when I've done everything up to the *start* of `local_group_default_T_mediarequest_per_file/data`. [09:24:16] awesome btullis - I'll also wait for the rerun of the one with the unexpected error before testing :) [09:25:00] 👍 [09:47:19] We get the same error from that table on the 4th snapshot too. [09:47:23] https://www.irccloud.com/pastebin/u9HRkxvf/ [09:48:29] I can't read --^ :S [09:48:43] Deleted the pastebin, because it had our (temporary) password. [09:48:53] `Skipping file la-649-big-Data.db: table local_group_default_T_pageviews_per_project.data doesn't exist` [09:49:14] Ah yes! [09:49:17] of course [09:49:38] the table that doens't exist :) I should have been more careful - This is the one that we don't use anymore [09:49:45] no problem :) [09:49:49] Cool. [09:55:55] I'm still a bit confused as to why we were able to snapshot data from a non-existent table. I guess that if we'd run `nodetool cleanup` before taking the snapshot on the source, it might not have dumped the data. [10:17:38] I think the confusion comes from us deliberately not creating that table on the new cluster, but it still exists on the old cluster - I coul dbe wrong [10:18:45] Oh right. I see. I had forgotten that the destination tables were manually created. I thought that `sstableloader` created them as required. [10:50:55] joal: I think that all of these tables have been loaded now. [10:50:59] https://www.irccloud.com/pastebin/Sai1ACFN/ [10:51:46] I'm about to kick off the 4th import of `local_group_default_T_mediarequest_per_file/data` [11:08:25] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) Thanks for that explanation. It makes sense, I just didn't want to think that we h... [11:36:49] Starting my QA checks btullis - Thanks for the heasd up :) [11:45:51] hm - I don't know if the concern is related to cassandra-compactions or something else, but I have differences in data between old and new for the supposedly fully loaded tables :S [11:46:47] Oh dear. Missing data, or duplicate data, or something else? [11:46:58] quite some missing data :S [11:47:54] Shall we head to the batcave? [11:48:03] we can do that yeah [11:51:04] 10Analytics, 10Data-Engineering, 10Growth-Team, 10Metrics-Platform, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10kostajh) Moving this to #growth-team's triaged column as @Mholloway is working on this; if there is... [11:57:41] btullis: here is an example of mediarequest giving different results: curl http://aqs1010-a.eqiad.wmnet:7232/analytics.wikimedia.org/v1/mediarequests/aggregate/all-referers/audio/all-agents/daily/20151201/2015123 [11:57:46] btullis: here is an example of mediarequest giving different results: curl http://aqs1010-a.eqiad.wmnet:7232/analytics.wikimedia.org/v1/mediarequests/aggregate/all-referers/audio/all-agents/daily/20151201/20151231 [11:57:55] sorry (+1 at then end) [11:58:18] Thanks. I'm checking now. [11:58:19] You need to change the host from 1010-a (new) to 1004-a(old) [11:58:27] Thanks a lot btullis [12:02:51] I think I have found the (i.e. my) error. Checking now. [12:02:59] Ack btullis [12:03:17] that's good news :) if it's a mistake it means it's not a system problem :) [12:17:53] I believe that I've fixed it all now. Here's that mediarequests query returning the same information from both clusters. [12:17:57] https://www.irccloud.com/pastebin/jNbSjA5n/ [12:18:34] btullis: starting mforns mighty script :) [12:22:21] I confirm it looks good btullis :) [12:22:27] thanks a lot for the quick fix! [12:22:40] btullis: I'm checking over a wider time range [13:17:34] o/ hello! [13:17:38] o/ joal event completeness meeting? [13:23:21] ouch ottomata - joining [13:35:03] Hello ottomata. [13:38:20] :] [13:45:49] btullis: more than a year of daily data queried per month without any miss :) [13:46:17] Great! Thanks for the update. [13:47:09] (03CR) 10Andrew Bogott: [C: 03+2] test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 (owner: 10Andrew Bogott) [13:47:17] (03CR) 10Andrew Bogott: [C: 03+2] Test run routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720843 (owner: 10Andrew Bogott) [13:50:24] great to hear [13:50:40] (03Merged) 10jenkins-bot: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 (owner: 10Andrew Bogott) [13:50:46] (03Merged) 10jenkins-bot: Test run routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720843 (owner: 10Andrew Bogott) [14:23:45] (03PS2) 10MNeisler: Add the content_translation_event stream to the allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) [14:35:04] (03CR) 10MNeisler: Add the content_translation_event stream to the allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [14:45:25] as FYI the mediawiki switchover codfw -> eqiad is happening right now [14:45:52] Watching carefully and tailing along. [14:48:44] TIL: http://listen.hatnote.com/ [14:58:09] (03CR) 10Mforns: Add the content_translation_event stream to the allowlist (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/716339 (https://phabricator.wikimedia.org/T281511) (owner: 10MNeisler) [15:22:53] (03CR) 10Jgiannelos: Map tile state change event schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/716219 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [15:23:51] joal: just discovered it https://github.com/baineng/feast-hive [15:24:52] (03PS1) 10Mforns: Add WikibaseTermboxInteraction to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/721010 (https://phabricator.wikimedia.org/T290303) [15:35:48] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add WikibaseTermboxInteraction to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/721010 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [15:36:37] (03Merged) 10jenkins-bot: Add WikibaseTermboxInteraction to analytics/legacy [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/721010 (https://phabricator.wikimedia.org/T290303) (owner: 10Mforns) [15:43:32] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 4 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns) [16:02:14] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:06:59] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 4 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns) [16:10:34] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10Cmjohnson) 05Open→03Resolved Replaced the disk and added back to the array cmjohnson@an-worker1096:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 Adapter 0: Created VD 6 Config... [16:17:13] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10razzi) [16:21:28] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10BTullis) a:05RKemper→03Ottomata [16:21:51] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Ottomata) Thanks! [16:32:52] 10Analytics, 10Data-Engineering, 10Growth-Team, 10Metrics-Platform, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) > Ottomata, in your last comment you say you're not opposed, however the patch has a -1 o... [16:44:35] 10Analytics, 10DC-Ops, 10Data-Engineering, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10RobH) [16:44:51] 10Analytics, 10DC-Ops, 10Data-Engineering, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install an-presto10[06-15] - https://phabricator.wikimedia.org/T290987 (10RobH) [17:29:53] ben lemme know if you wanna discuss testing stuff without merging in puppet, i tis possible! [17:30:02] btullis: ^ [18:50:10] (03PS1) 10Dave Pifke: Add TLS support [analytics/statsv] - 10https://gerrit.wikimedia.org/r/721044 (https://phabricator.wikimedia.org/T290131) [19:09:37] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) @elukey IIRC we told them not to rely on it. Its used for occasionally loading data, and I believe they were told to point to the internal address rather than go throug... [19:13:02] 10Analytics: Check home/HDFS leftovers of fdans - https://phabricator.wikimedia.org/T290231 (10Ottomata) a:03Ottomata [19:13:17] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of fdans - https://phabricator.wikimedia.org/T290231 (10Ottomata) [19:13:46] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of fdans - https://phabricator.wikimedia.org/T290231 (10Ottomata) Following https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Have_any_users_left_the_Foundation? ` 15:12:37 [:/Users/otto] $ wmf-check-analytics-home fdans ====== stat1004 ======... [19:15:38] 10Analytics: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10Ottomata) Following https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Have_any_users_left_the_Foundation? ` ====== stat1004 ====== total 0 ====== stat1005 ====== total 700156 -rw-r--r-- 1 24076 wikidev 623224... [19:16:08] 10Analytics: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Ottomata) Following https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Have_any_users_left_the_Foundation? ` 15:15:09 [:/Users/otto] $ wmf-check-analytics-home gilles ====== stat1004 ====== total 266872 -rw-rw-r... [19:17:07] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of jkatz - https://phabricator.wikimedia.org/T287235 (10Ottomata) a:03Ottomata [19:19:57] 10Analytics: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10Ottomata) @AChang_WMF hi! Do you know if any of the above data needs to be kept? Are you the right person to ask? If not, who should I ask? Thanks! [19:20:56] 10Analytics: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Ottomata) @Krinkle, not sure if you are the right person to ask, but do you know if there is any reason to save any of the above data? [19:21:14] milimetric: yt? want to do some data deletion [19:21:17] https://phabricator.wikimedia.org/T287235 [19:21:22] want another pair of eyeballs [19:27:46] or maybe mforns ^ ? [19:27:57] I'm here ottomata - batcave? [19:28:01] ok! gr8! [19:37:35] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of jkatz - https://phabricator.wikimedia.org/T287235 (10Ottomata) Done! ` root@stat1006:/home/jkatz# rm -rf /home/jkatz/* root@stat1006:/home/jkatz# ls -la total 52 drwxr-xr-x 5 10747 wikidev 4096 Sep 14 19:31 . drwxr-xr-x 241 root wikidev 4096... [19:42:03] (03CR) 10Ottomata: [C: 03+1] Update Gobblin kafka fetch timeout to 5s [analytics/refinery] - 10https://gerrit.wikimedia.org/r/720317 (https://phabricator.wikimedia.org/T290723) (owner: 10Joal) [19:44:01] 10Analytics: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Krinkle) Some of the scripts (`.py`, `.sh`, `.sql`, `.hql`) may be helpful as I'm not sure we documented all the analysis in question for some of our datasets. I don't think any of the data or other files need to be... [19:44:05] 10Analytics, 10Performance-Team: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Krinkle) [19:51:12] 10Analytics, 10Performance-Team: Check home/HDFS leftovers of gilles - https://phabricator.wikimedia.org/T290232 (10Ottomata) Ok. @Krinkle can I copy them to your homedirs and chown them to you? [19:53:40] I'm here now, sorry had to babysit [19:55:18] 10Analytics: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10AChang_WMF) I'm not really sure, but Margeigh Novotny should know. Thanks! -- Ana Chang (she/her) Design Strategy Manager Wikimedia Foundation [20:07:47] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10RKemper) @Ottomata Looks like I didn't make separate SAL logs when doing the eqiad / codfw (production) side of the helmfile stuff, but here's the automated logs for peace of mind... [20:14:11] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) Ok new plan then: # Do another rsync from thorium # update any (non LVS) remaining references in puppet to point at analytics-web.discovery.wmnet (or an-web1001 if need... [20:55:15] o/ ottomata: - Yes, would be great to discuss other methods of testing. Maybe we can fit it tomorrow some time? [20:56:38] ya ping me!