[01:25:02] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:27:17] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:09:17] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:10:17] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:37:52] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:06:06] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:06:34] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:13:51] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:14:11] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:41:51] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:45:53] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:46:23] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) Dry-run for sre.zookeeper.roll-restart-zookeeper succeeded. [11:46:39] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) [12:14:18] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:35:39] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:36:05] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:37:51] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:39:12] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:43:20] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:44:22] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:47:49] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [13:36:25] hellooo teamm :] [13:36:29] Hi mforns :) [13:36:57] Hello mforns. o/ [13:37:26] I was thinking yesterday that I'm too lucky that my ops week is after joal's, we should change that, others should enjoy the absence of alerts as well... [13:40:33] I'm currently looking into the workflow for testing a puppet change. Am I right in thinking that I need to set up a VM under Horizon, so that I can apply my changes to it? [13:41:01] 10Analytics-Kanban, 10Patch-For-Review: Add a spark loader to support Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10JAllemandou) [13:41:16] 10Analytics-Kanban, 10Patch-For-Review: Add a spark job loading Cassandra 3 - https://phabricator.wikimedia.org/T280649 (10JAllemandou) [13:41:30] (03PS3) 10Joal: [WIP] Load cassandra3 from spark [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/686629 (https://phabricator.wikimedia.org/T280649) (owner: 10Milimetric) [13:42:27] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:51:01] (03Abandoned) 10Milimetric: [WIP] Refactor state for cleanliness and consistency [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/631791 (https://phabricator.wikimedia.org/T262725) (owner: 10Milimetric) [13:54:00] (03CR) 10Milimetric: [C: 03+2] Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596) (owner: 10Fdans) [13:54:21] btullis: the horizon road is very painful, it takes a lot to configure a VM in the cloud realm with puppet etc.. plus it would need a dedicated puppetmaster (self-hosted) to be able to apply/test your change [13:54:56] https://wikitech.wikimedia.org/wiki/Puppet/Pontoon is aimed to fill the gap in puppet testing, but I have never really use it (good point of contact: Filippo) [13:55:32] what I usually do is to test the change via puppet catalog compiler (utils/pcc in the puppet repo for the script, or https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/ for a manual run) [13:55:38] (03CR) 10jerkins-bot: [V: 04-1] Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596) (owner: 10Fdans) [13:55:52] so I can get the diff [13:56:07] then if multiple hosts are affected, you can disable puppet on them and run puppet one at the time [13:56:32] it is still not the best/perfect way to remove any doubt of problems, but it is a compromise [13:56:42] OK, thanks elukey. I saw pontoon mentioned in #wikimedia-sre this morning. Looks useful and I might investigate it. [13:56:54] definitely Filippo is doing a great work [13:57:05] it is aimed to remove all the boilerplate stuff [13:57:25] but the puppet catalog compiler is usually enough for simple/medium things [13:57:44] (if you have a specific change in mind I can review/help deploying it) [13:58:16] I also suggest to read https://wikitech.wikimedia.org/wiki/Cumin#Host_selection if you haven't [13:58:26] cumin is REALLY awesome, it can pull data from puppetdb [13:58:33] (03CR) 10Milimetric: [C: 03+2] Change state to allow more than one project [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/697797 (https://phabricator.wikimedia.org/T283624) (owner: 10Fdans) [13:58:34] It's work that I'm looking at for https://phabricator.wikimedia.org/T268985 so it would affect all Kerberos clients and I'll need to be super-careful. [13:59:13] if you want to have an idea about the impacted hosts [13:59:31] sudo cumin 'c:profile::kerberos::client' [13:59:35] (will emit only a list) [13:59:38] (from cumin1001) [13:59:48] (03CR) 10jerkins-bot: [V: 04-1] Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596) (owner: 10Fdans) [13:59:50] (03CR) 10jerkins-bot: [V: 04-1] Change state to allow more than one project [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/697797 (https://phabricator.wikimedia.org/T283624) (owner: 10Fdans) [14:02:29] Cool, thanks. I'll have a think. I'm more used to puppet environments <-> branches so this will take a bit of getting used to. I can still see some value in a VPS project for Kerberos, but it does sound like a lot of work. [14:04:58] btullis: I'd suggest not to proceed in that direction (especially for kerberos) to avoid a lot of frustration and time spent in debugging why things are not working (the horizon infra has always been a little difficult to work with puppet). Pontoon might be a solution, but spinning up a separate KDC etc.. for kerberos in cloud is challenging [14:05:30] this is also why we have a "Test" cluster in production, since cloud vs prod are very different from the puppet point of view [14:05:31] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:07:43] elukey: Gotcha. 👍 So I could maybe target a kerberos (krenew) change at an-test-client1001.eqiad.wmnet and exclude the production clients. Would that work? [14:08:49] btullis: yeah I think it would be a good compromise! [14:09:10] so you'll be able to test various changes making sure that all works etc.. [14:09:24] Nice, thanks. [14:09:25] (spark, hadoop tools, etc..) [14:14:50] for anything kerberos-related I am free anytime for a brainbounce [14:18:23] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:46:56] Hey all! [14:47:17] Who can I ask to get these two patches merged: [14:47:17] https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/703838 [14:47:17] https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/703753 [15:21:39] Hi all! I’m still eating breakfast but I’m game for game time, or just chill and chat, I’ll be in the batcave in 15! [15:28:46] Andrew-WMDE: ask mforns and/or milimetric [15:29:02] hi Andrw [15:29:10] oops, hi Andrew-WMDE, I'll take a look [15:29:37] milimetric, Andrew-WMDE I think you should have the power to merge no? [15:30:00] I gave a +1 :] [15:33:36] mforns: Thanks for the +1! [15:33:48] but I don't have the power to merge [15:33:53] O.o [15:34:00] I though you had, you should [15:34:28] milimetric: didn't we allow for other devs to merge reportupdater-queries? [15:34:53] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add aggregations for template data usage in TemplateWizard [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703838 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [15:34:57] I did, Andrew-WMDE, are you not in that group... lemme see [15:35:04] (03CR) 10Mforns: [V: 03+2 C: 03+2] Add aggregations for template data usage in VE's template dialog [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/703753 (https://phabricator.wikimedia.org/T272589) (owner: 10Andrew-WMDE) [15:36:30] I merged both changes Andrew-WMDE, I hope next time you are free to do that as well :] [15:37:06] milimetric: how do you add people to the group? is it in gerrit? [15:37:09] Andrew-WMDE: the group I gave access to was wmde-qwerty, are you in there? [15:37:13] mforns: ^ [15:37:19] aha, via puppet? [15:39:06] milimetric: Yep, I'm in wmde-qwerty [15:42:10] but for some reason I'm not able to +2 in analytics/reportupdater-queries [15:51:17] 10Analytics: [EventGate] Failures when getting stream config from MediaWiki API - https://phabricator.wikimedia.org/T286793 (10mforns) [16:10:55] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:15:29] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:19:43] mforns, milimetric: I’m off for today, thanks again for merging the patches! [16:32:37] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:07:02] 10Analytics, 10Better Use Of Data, 10Metrics-Platform, 10Product-Data-Infrastructure: Define acceptable usage of the `meta` object in event schemas - https://phabricator.wikimedia.org/T273293 (10DAbad) [17:14:41] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Document in-schema who sets which fields - https://phabricator.wikimedia.org/T253392 (10DAbad) [17:18:14] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Data-Infrastructure, 10Metrics-Platform (Metrics-Platform-MVP-Release-1): Client-side error logging should use Elastic Common Schema (ECS) fields when possible - https://phabricator.wikimedia.org/T267602 (10DAbad) @jlinehan @Ottomata see we h... [17:18:33] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Product-Data-Infrastructure: Client-side error logging should use Elastic Common Schema (ECS) fields when possible - https://phabricator.wikimedia.org/T267602 (10DAbad) [17:19:46] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform, 10Platform Team Workboards (Clinic Duty Team): Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10DAbad) [17:53:16] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Mholloway) >>! In T277193#6905467, @Ottomata wrote: > Solutions? > > A. Restructure wgEventStreams to be keyed by str... [18:01:58] (03CR) 10Milimetric: "Looks good, heading in the right direction. Style problem on the Wiki selector with the icons (the way it is will hover over your time se" (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/700098 (https://phabricator.wikimedia.org/T285050) (owner: 10Fdans) [19:05:16] 10Analytics, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10cjming) [19:05:51] 10Analytics, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10cjming) [19:07:46] 10Analytics, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10cjming) [19:10:55] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string - https://phabricator.wikimedia.org/T286814 (10cjming) [19:13:02] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: '.event.abort_timing' should be integer - https://phabricator.wikimedia.org/T286815 (10cjming) [20:44:56] (03PS1) 10Mholloway: Add Refine transform function to add normalized host [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) [20:53:03] (03CR) 10Mholloway: Add Refine transform function to add normalized host (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway) [20:55:10] (03PS2) 10Mholloway: Add Refine transform function to add normalized host [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) [20:56:00] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Patch-For-Review: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Mholloway) I gave this a try for 10% time today.