[00:23:29] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:26:22] 10Analytics: Check home/HDFS leftovers of gsingers - https://phabricator.wikimedia.org/T287845 (10MoritzMuehlenhoff) [06:10:50] (03CR) 10Gergő Tisza: [C: 03+1] Suggested Edits: Update homepagemodule schema to support new mobile navigation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/705493 (https://phabricator.wikimedia.org/T268708) (owner: 10MewOphaswongse) [07:15:41] 10Analytics, 10Analytics-Wikistats, 10I18n: Translations not getting imported into Wikistats - https://phabricator.wikimedia.org/T287661 (10Aklapper) [08:41:33] (03CR) 10Gergő Tisza: [C: 03+1] Add a link: Update schema to support edit mode and link inspector toggles [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [09:14:08] (03CR) 10Svantje Lilienthal: added template wizard sessions (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [09:24:11] (03CR) 10Gergő Tisza: [C: 03+2] Add postedit-task-refresh to analytics/legacy/helppanel [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702472 (https://phabricator.wikimedia.org/T272664) (owner: 10MewOphaswongse) [09:24:49] (03Merged) 10jenkins-bot: Add postedit-task-refresh to analytics/legacy/helppanel [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702472 (https://phabricator.wikimedia.org/T272664) (owner: 10MewOphaswongse) [09:27:10] (03PS4) 10Svantje Lilienthal: added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) [09:28:08] (03CR) 10Svantje Lilienthal: "> Patch Set 3:" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [09:49:00] (03CR) 10Awight: [C: 03+1] "It looks ready to merge, but I've left one question about the output column name. If you decide to rename, it's best to do now before the" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [09:53:52] (03PS5) 10Svantje Lilienthal: added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) [09:55:32] (03CR) 10Svantje Lilienthal: "> Patch Set 4: Code-Review+1" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [11:15:59] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) [11:23:32] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Majavah) [12:32:44] 10Quarry, 10Patch-For-Review: Add a stop button to halt the query - https://phabricator.wikimedia.org/T71037 (10mdipietro) a:03mdipietro [12:41:37] (03CR) 10Mholloway: [C: 03+2] Suggested Edits: Update homepagemodule schema to support new mobile navigation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/705493 (https://phabricator.wikimedia.org/T268708) (owner: 10MewOphaswongse) [12:42:16] (03Merged) 10jenkins-bot: Suggested Edits: Update homepagemodule schema to support new mobile navigation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/705493 (https://phabricator.wikimedia.org/T268708) (owner: 10MewOphaswongse) [13:28:01] 10Analytics, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Ottomata) I just went back in to figure out why https://stream-beta.wmflabs.org/v2/ui/#/ wasn't showing events, but it is! Can you all see yo... [13:35:52] 10Analytics, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Michaelcochez) I can see the property suggester events. @Addshore can you now also see the other events you were testing for? Last week @Ott... [13:41:58] (03PS6) 10Svantje Lilienthal: added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) [13:43:09] 10Analytics-Clusters: Deploy an-test-coord1002 as a Ganeti VM to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10BTullis) [13:52:50] (03CR) 10Awight: "PS 7: tweaking start date" (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [13:52:57] (03PS7) 10Awight: added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [13:53:02] (03CR) 10Awight: [C: 03+2] added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal) [14:14:31] (03PS4) 10Mholloway: Add Refine transform function to add normalized host [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) [14:14:36] (03CR) 10Mholloway: Add Refine transform function to add normalized host (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [14:27:54] (03PS1) 10Mholloway: Enable tests requiring hive UDF support [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 [14:38:08] (03PS2) 10Mholloway: Enable tests requiring hive UDF support [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 [14:39:19] (03CR) 10Mholloway: Add Refine transform function to add normalized host (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [14:46:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) I have railed a bug report for {T287869} ...and attempted a patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+... [14:47:39] (03CR) 10Ottomata: [C: 03+2] Add Refine transform function to add normalized host [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [14:48:03] (03CR) 10Ottomata: [C: 03+2] "Thank you!!!!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 (owner: 10Mholloway) [14:48:15] (03CR) 10jerkins-bot: [V: 04-1] Enable tests requiring hive UDF support [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 (owner: 10Mholloway) [14:49:29] 10Analytics, 10Analytics-Wikistats: wikistats: montly pageview dumps are not bz2 files - https://phabricator.wikimedia.org/T287684 (10fdans) [14:49:31] 10Analytics, 10Dumps-Generation: Monthly Wikimedia pageviews dumps cann't be decompressed - https://phabricator.wikimedia.org/T287565 (10fdans) [14:49:50] (03PS3) 10Ottomata: Enable tests requiring hive UDF support [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 (owner: 10Mholloway) [14:51:28] 10Analytics-Clusters: Deploy an-test-coord1002 as a Ganeti VM to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10Ottomata) +1 We might need to also make an-test-launcher1001, as an-test-coord1001 is currently serving the role of both an-coord1001 and an-l... [15:07:27] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Analytics, 10Patch-For-Review: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) [15:08:03] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Analytics, 10Patch-For-Review: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) a:05Ottomata→03Mholloway [15:19:51] (03CR) 10Ottomata: [C: 03+2] Enable tests requiring hive UDF support [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/709479 (owner: 10Mholloway) [15:28:21] 10Analytics-Radar, 10Product-Analytics, 10Growth-Team (Current Sprint): Add geolocation information to Growth schemas - https://phabricator.wikimedia.org/T287121 (10mewoph) Hi @nettrom_WMF — I can add it to the existing patch [15:31:51] a-team groskin? [15:33:52] 10Analytics: Check home/HDFS leftovers of gsingers - https://phabricator.wikimedia.org/T287845 (10odimitrijevic) p:05Triage→03High [15:34:32] 10Analytics, 10Event-Platform: Enable canary events for streams by default - https://phabricator.wikimedia.org/T287789 (10odimitrijevic) p:05Triage→03High [15:35:51] 10Analytics, 10Analytics-Kanban, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10odimitrijevic) p:05Triage→03High [15:36:52] (03PS5) 10MewOphaswongse: Add a link: Update schema to support edit mode and link inspector toggles; add client_ip [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) [15:37:13] 10Analytics-Clusters: Deploy an-test-coord1002 as a Ganeti VM to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10Ottomata) a:03Tullis [15:37:21] 10Analytics-Clusters: Deploy an-test-coord1002 as a Ganeti VM to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10Ottomata) p:05Triage→03Medium [15:37:24] (03PS6) 10MewOphaswongse: Add a link: Update schema to support edit mode and link inspector toggles; add client_ip [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) [15:37:31] 10Analytics: Delete HDFS raw *_camus directories 60 days after July 12 (after 2021-09-10) - https://phabricator.wikimedia.org/T287685 (10odimitrijevic) p:05Triage→03Medium [15:39:20] 10Analytics, 10Analytics-Wikistats, 10I18n: Translations not getting imported into Wikistats - https://phabricator.wikimedia.org/T287661 (10odimitrijevic) p:05Triage→03Medium [15:40:19] 10Analytics-Clusters, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10Ottomata) a:05razzi→03None [15:43:08] 10Analytics-Clusters: Deploy an-test-coord1002 as a Ganeti VM to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10Ottomata) a:05Tullis→03BTullis [15:44:11] 10Analytics: Review use of realloc in varnishkafka - https://phabricator.wikimedia.org/T287561 (10odimitrijevic) What is involved in seeing this patch to production? [15:46:17] 10Analytics-Clusters: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) a:05BTullis→03RKemper [15:47:58] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) [15:48:00] 10Analytics: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10odimitrijevic) p:05Triage→03High a:03BTullis [15:48:46] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10BTullis) I wonder if this needs to be changed. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/statistics/manifests/rsync/publishe... [15:48:51] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10Ottomata) a:05BTullis→03RKemper [15:49:09] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10Ottomata) [16:00:38] 10Analytics-Radar, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): android image_recommendation_interaction error - https://phabricator.wikimedia.org/T284620 (10odimitrijevic) @Sharvaniharan this data will be deleted automatically in 90 days. Let us know if you wish to keep it. [16:08:11] (03CR) 10Ottomata: [C: 03+1] Add a link: Update schema to support edit mode and link inspector toggles; add client_ip [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/704402 (https://phabricator.wikimedia.org/T278115) (owner: 10MewOphaswongse) [16:09:01] 10Analytics-Clusters, 10Analytics-Kanban: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) Def ^ [16:16:16] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Client-side error logging should use Elastic Common Schema (ECS) fields when possible - https://phabricator.wikimedia.org/T267602 (10DAbad) a:03jlinehan [16:18:01] 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Explore sending batches of events from EPC libraries - https://phabricator.wikimedia.org/T239996 (10Mholloway) TODO: Verify that this is indeed happening in all consolidated libraries. [16:31:55] 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Explore sending batches of events from EPC app libraries (Java, Swift) - https://phabricator.wikimedia.org/T239996 (10Mholloway) [16:35:10] 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Explore sending batches of events from EPC app libraries (Java, Swift) - https://phabricator.wikimedia.org/T239996 (10Mholloway) [16:35:44] 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Send batches of events from EPC app libraries (Java, Swift) - https://phabricator.wikimedia.org/T239996 (10Mholloway) [16:41:22] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10odimitrijevic) [16:41:49] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10odimitrijevic) This is a parent task for outstanding Gobblin work. [16:43:22] 10Analytics, 10Analytics-Kanban: Add ability to compare wikis - https://phabricator.wikimedia.org/T283251 (10odimitrijevic) [17:01:55] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) [17:02:40] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) Dry run succeeded on the kafka cookbooks, which were the only outstanding change to be tested. [17:02:51] 10Analytics, 10Analytics-Kanban: Analytics Ops Technical Debt - https://phabricator.wikimedia.org/T240437 (10BTullis) [17:02:53] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) 05Open→03Resolved [17:09:45] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) [17:41:45] Hi btullis elukey ottomata if any of y'all are around, I'm about to start the druid java restarts: [17:41:46] sudo cookbook sre.druid.roll-restart-workers public [17:41:46] sudo cookbook sre.druid.roll-restart-workers analytics [17:41:46] View relevant grafana metrics for the cluster: https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All [17:42:04] I'll wait for confirmation, but I think it's pretty safe [17:45:20] razzi: i'm here [17:45:30] proceed! [17:56:46] Cool ottomata here goes! [17:57:26] !log sudo cookbook sre.druid.roll-restart-workers public [17:57:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:40:08] 10Quarry: dev quarry running with write access - https://phabricator.wikimedia.org/T287902 (10mdipietro) [18:49:22] All Druid jvm restarts completed! [18:49:22] END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [18:49:24] cool [18:49:36] !log sudo cookbook sre.druid.roll-restart-workers analytics [18:49:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:18:36] Looks like restarting druid java processes cleared the cache, which caused druid performance to go down for a bit, but the cache hit rate is climbing up; looks like we're fine [19:20:03] p90 query response time peaked at around 6 seconds, so not terrible [19:25:38] We've got an alert at the moment: `Status of the systemd unit monitor_refine_eventlogging_analytics on an-launcher1002` - Should I be doing anything about it? [19:28:18] Oh I see. This related to the emails that ottomata sent earlier today, right? [19:33:44] I also noticed that btullis, from logs on the host I see Targets with failures: `event`.`uploadwizarduploadflowevent` /wmf/data/event/uploadwizarduploadflowevent/year=2021/month=7/day=31/hour=21 [19:34:28] Yeah I think ottomata fixed the actual error, and we need to reset the failed status of systemd [19:35:04] I'm about to sign off for a couple hours for an appointment, I'll be online later in the day! [19:36:08] All druid restarts are done \o/ [19:36:27] OK, I'll be online for a bit more. So according to the runbook, I would run: `sudo systemctl reset-failed monitor_refine_eventlogging_analytics.service` Is that correct ottomata? [19:36:39] Great work on the restarts. Sorry I missed your message earlier. [19:37:17] btullis: yes that would work, but also the timer will fix itself eventually [19:37:24] actually, i'm not totally sure why it doesnt' ifx itself on each run... [19:37:30] maybe it only runs once a day [19:37:31] checking [19:37:39] btullis: so ya there are two alerts usually [19:37:41] one is a safety check [19:37:44] one is about a specific job failure [19:37:55] the other is making sure that things that should have been refined over the past N hours have been [19:38:05] so, if the first fails and we do'nt re-run within some time period [19:38:09] we'll get the second alert [19:38:14] usually that only happens over weekends [19:39:03] https://usercontent.irccloud-cdn.com/file/kLEZMBga/image.png [19:39:12] oh yeha, they only run once a day [19:39:34] so the next run (in 4.5 hrs) would suceed [19:39:41] btullis: unless you particularly like icinga, I recommend viewing alerts.wikimedia.org/ which has the same alerts but much more... fashionably [19:39:42] it can be run manually, or you can reset-failed [19:41:04] Is that the two alerts for this job? So the top one is still failed because it hasn't re-run, but the second is still OK because... it's been re-run manually? [19:42:39] razzi: TIL about alerts.wikimedia.org :O [19:42:45] neat [19:43:40] Thanks razzi. I'm struggling a bit with alerts.wikimedia.org - I find my pupils flailing all over the place, trying to work out what's relevant to me. Maybe I'll get the hang of it. :-) [19:44:01] haha I noticed the same too :P [19:44:46] btullis: so ottomata will know the actual answer but based on what he said above, it sounds like the second alert is an alert about the actual state of the world, i.e. if it's firing it's saying "there's stuff that needs to be refined" [19:45:02] whereas the first alert is for the actual job that does that work...so if the first alert fires, eventually the second one will too, but delayed [19:45:51] btullis: exactly [19:45:55] I guess your question-behind-the-question is whether we should manually kick off the job right now or just wait until the next run...to which I have no idea :D [19:46:18] the monitor_refine_* ones [19:46:20] just check [19:46:24] and they only run once a day [19:46:32] the refine_* ones are the status of the last run timer [19:46:37] for that refine job [19:46:40] and that runs once an hour [19:46:58] so say a refine_* job fails in hour 10 [19:47:03] it will alert and be marked as failed [19:47:08] then next hour runs and it succeeds [19:47:15] it will be marked as OK [19:47:30] but we might miss the fact that something actually went wrong, since now the job's timer is reporting OK [19:47:38] so, we schedule a second job, monitor_refine_* [19:47:50] to look back over the past N hours (it might be 28?) [19:47:59] when it runs, if anything has failed in the last 28 hours [19:48:05] it is set as failed [19:48:13] so, this morning I reran a failed refine_* job [19:48:21] and since then its been suceeding too [19:48:32] its just that the monitor_refine_* job hasn't run yet [19:48:42] ALL of this hopefully we'll be easier to understand with ariflow [19:48:55] since we'll have a visual indication of job status per hour [19:52:18] (03CR) 10Milimetric: [C: 04-1] "This all looks good to me, just wondering, did you check these references to Camus for any deeper cleaning?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [19:54:13] Right. I /think/ I've got it. But if "the refine_* ones are the status of the last run timer"... [19:54:57] Why doesn't this service show any failures over the last 7 days? https://icinga.wikimedia.org/cgi-bin/icinga/trends.cgi?t1=1627847539&t2=1627933939&host=an-launcher1002&service=Check+unit+status+of+refine_eventlogging_analytics&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedhoststate=0&initialassumedservicestate=0&backtrack=4&timeperiod=last7days&zoom=4 [19:54:58] ya? [19:55:39] I would have expected it to show it as critical for the hour after the failed job, until the next time it ran. [19:56:14] Hmm good q, I would have expected one failure along with the Refine failures for job refine_eventlogging_analytics email from yesterday [19:56:18] right [19:56:18] hm [19:56:30] maybe its not actually exiting non zero? i think it sholud be though [19:56:38] looking [19:56:57] Phew. I thought I might just need a cup of tea and a lie down. [19:57:42] OH. i remember [19:57:48] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/Refine.scala#L348-L351 [19:57:54] this is run in YARN cluster mode [19:58:04] so we can't actually get a valid exit val [19:58:11] hence why the job itself sends the email [19:58:17] rather than allowing the systemd timer to do [19:58:17] it [19:58:40] so, what i said before was wrong! [19:58:51] I guess the refine_* job will always succeed [19:59:04] according to icinga anyway [19:59:20] Ah right, so that's why we got two emails, one from `refine@an-worker1125.eqiad.wmnet` [19:59:33] ? [19:59:37] yup [19:59:51] the one from the worker is the spark master process sending an email [20:00:02] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/Refine.scala#L522-L545 [20:03:06] (03CR) 10Milimetric: [C: 04-1] "ah! never mind that first one, obviously that's what you do in the next change." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/708786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:03:44] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Enable canary events for streams by default - https://phabricator.wikimedia.org/T287789 (10Ottomata) [20:11:07] ottomata: wrt https://phabricator.wikimedia.org/T285355, for the trafficserver mapping rules I should change`thorium.eqiad.wmnet:8443` to `an-web1001.eqiad.wmnet:8443`, right? https://gerrit.wikimedia.org/g/operations/puppet/+/c7a349388042a2db04120ee767e4a0687a85b5ec/hieradata/common/profile/trafficserver/backend.yaml#9 [20:11:36] yes exactly [20:12:03] you can wait on that one until you've got an-web1001 up and running and you can test the stuff there [20:12:58] that (mostly) answered the followup I was about to ask :P [20:13:20] :) [20:13:38] 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10nskaggs) 05Open→03Resolved As the old clusters are now completely offline, this can be considered closed. [20:13:55] ottomata: So as far as order of operations, we'll want `role(analytics_cluster::webserver)` on both `thorium` and `an-web1001` in parallel, and then once the rsync nad everything has worked propelry then cut over by decom'ing `thorium` in `site.pp` as well as changing those backend rules? [20:14:09] rsync and everything has worked properly* [20:14:44] yeah! [20:14:49] u got it [20:14:50] :) [20:14:56] thanks [20:14:58] thank you! [20:32:05] 10Quarry: Validate and autocomplete database names in the database input field - https://phabricator.wikimedia.org/T287471 (10nskaggs) Consider an implementation that also allows for database selection and resolves T76466 also. [20:36:27] 10Quarry: Allow filtering of the final report - https://phabricator.wikimedia.org/T61764 (10nskaggs) 05Open→03Invalid I believe this bug was originally filed as a request for tsreports which is no longer in operation (If this isn't the case, feel free to re-open). Hence I'm now marking as invalid for Quarry.... [20:39:03] 10Quarry: Make query URLs have a sluggified version of the title in them - https://phabricator.wikimedia.org/T75885 (10nskaggs) [20:39:06] 10Quarry: Replace spaces with underscores for wiki usernames in URL - https://phabricator.wikimedia.org/T72166 (10nskaggs) [20:47:02] 10Quarry: Make query URLs have a sluggified version of the title in them - https://phabricator.wikimedia.org/T75885 (10bd808) Quarry's titles are mutable, and URLs that break because of a title change will annoy people and break list of useful queries that have been stored outside of Quarry. It should be possibl... [20:51:52] 10Analytics, 10Analytics-Kanban: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10BTullis) Hi @ssingh - I've deployed [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/709484 | the change ]] that I believe will make hive less noisy in production. Specifical... [20:54:52] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Incorrect log4j configuration on hive servers causing excessive logging - https://phabricator.wikimedia.org/T279304 (10BTullis) This change is now merged and deployed. As expected, the `hive-server2` and `hive-metastore` processes have not been re... [20:58:01] I've deployed the log42 config change to an-coord1001, but I'll need to restart the `hive-server2` and `hive-metastore` services in order to pick up the new settings. [20:58:07] What's the process for requesting/announcing this kind of maintenance window? [20:58:40] btullis: i think there isn't an official process, you can probably do thhat real quick. [20:58:41] although [20:58:42] https://yarn.wikimedia.org/cluster/scheduler [20:58:51] looks like quite a few jobs running right now [20:59:14] you can usually ignroe the wmfdata or spark shell [20:59:15] ones [20:59:40] yeah might be worth scheduling something [20:59:50] i think you can do that just be sending an email to analytics-announce [20:59:58] 24hrs should be enough [21:01:30] Thanks. If I say 36 hours from now, that would be 09:00 UTC on Wednesday 4th, which would be better for me than 24 hours from now. [21:03:01] yup that'll be fine, whenever you prefer [21:06:43] Would it be correct to say? ... "YARN processes that are running at the time will likely return an error" [21:08:59] 10Analytics: Metrics tooltip in detail page is not localized - https://phabricator.wikimedia.org/T287908 (10fdans) [21:13:04] 10Analytics, 10I18n: Fixed time range names are forced to capitalized regardless of locale in sidebar - https://phabricator.wikimedia.org/T287910 (10fdans) [21:13:36] 10Analytics, 10I18n: Fixed time range names are forced to capitalized regardless of locale in sidebar - https://phabricator.wikimedia.org/T287910 (10fdans) p:05Triage→03Medium [21:17:26] 10Analytics, 10Analytics-Wikistats, 10I18n: Translations not getting imported into Wikistats - https://phabricator.wikimedia.org/T287661 (10fdans) 05Open→03Resolved @Sabeloga this is really valuable feedback! I've opened two tasks with the bugs you've mentioned. They are two super simple issues, but it's... [21:27:36] ncie email ben :) [21:27:42] Doh! Sent it and forgot to add the Phab task link. [21:27:47] yeah that sounds right, most likely just ones interacting with hive at the time you restart [21:30:19] Thanks ottomata. [21:30:39] I'm signing off for tonight. Be aware that https://gerrit.wikimedia.org/r/c/operations/puppet/+/709484 might well change Hive client logging levels for newly executed jobs. As mentioned in T274914 [21:30:40] T274914: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 [21:31:00] In case anyone asks. [21:44:32] ok cool [21:44:33] laters! [21:49:23] 10Analytics-Kanban, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10odimitrijevic) [21:49:29] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10odimitrijevic) [21:50:47] 10Analytics, 10Analytics-Kanban: When gobblin fails, we should know about it - https://phabricator.wikimedia.org/T286559 (10odimitrijevic) [21:50:49] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10odimitrijevic) [21:56:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10odimitrijevic) [21:56:55] 10Analytics: Purge gobblin files - https://phabricator.wikimedia.org/T287084 (10odimitrijevic) [21:56:59] 10Analytics-Kanban, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10odimitrijevic) [21:57:59] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10odimitrijevic) [22:00:31] 10Analytics: Finalize Gobblin Migration - https://phabricator.wikimedia.org/T287889 (10odimitrijevic) 05Open→03Declined Closing in favor of the existing parent task for the Camus -> Gobblin work: https://phabricator.wikimedia.org/T271232 [22:05:33] 10Analytics, 10 Data-Engineering: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10odimitrijevic) [22:06:28] 10Analytics, 10Analytics-Kanban, 10 Data-Engineering: When gobblin fails, we should know about it - https://phabricator.wikimedia.org/T286559 (10odimitrijevic) [22:07:18] 10Analytics-Kanban, 10 Data-Engineering, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10odimitrijevic) [22:12:09] (03CR) 10ODimitrijevic: "> Patch Set 4:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/706605 (https://phabricator.wikimedia.org/T280649) (owner: 10Joal) [22:24:18] Hi Everyone, the Analytics team is in the process of changing its name to Data Engineering to more closely reflect the work that we are doing. You will start seeing the name Data Engineering come up in different systems including here on IRC. We are still working out the finer details of what the changes entail and will communicate them accordingly. For the time being all the channels of communication & collaboration with us [22:24:18] remain the same. [23:01:05] olja: is this different than the team supposedly already called "Data Engineering"? https://www.mediawiki.org/wiki/Wikimedia_Product_Infrastructure_Data_Engineering [23:05:25] that’s right - the data engineering team is part of the technology department and is different from the product data engineering team [23:16:34] ok, seems a bit confusing to have 2 teams named the same thing...