[02:21:57] 10Analytics-Clusters, 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10lmata) [02:27:18] 10Analytics, 10SRE Observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10lmata) [02:39:36] 10Analytics-Radar, 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [07:38:15] (03PS8) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 [07:39:33] (03CR) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [07:39:45] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10elukey) [07:53:30] (03CR) 10DCausse: [C: 03+1] Rematerialize fragment schemas with generated examples. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [08:03:53] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [08:04:04] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:04:49] 10Analytics, 10SRE, 10Tracking-Neverending: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10fgiunchedi) [08:08:49] 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10RhinosF1) [08:13:13] 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Ladsgroup) Is it coming from puppet? It should be migrated to systemd timer if that's the case: {T273673} [08:36:55] Morning all. [08:41:33] good morning :) [08:59:57] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) >>! In T286065#7194569, @Bstorm wrote: > @aborrero does cloudgw require manual failover? it doesn't require manual failover, but we could... [09:02:15] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat) [09:07:07] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:07:53] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:12:04] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:13:25] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:15:19] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:15:56] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:29:30] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [09:38:52] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:26:36] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:29:13] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to //cl... [10:33:10] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:33:43] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) @Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cl... [11:34:08] Hi teq, [11:34:20] team - sorry - wrong layout :) [11:56:25] bonjour joal [11:56:36] Hi elukey - How are you? [11:57:14] elukey: feeling like a european-champ? ;) [11:59:12] joal: \o/ \o/ \o/ [12:06:12] I'm having a bit of an issue with the SSO this morning. Can't get into Icinga for example, although I've definitely logged in successfully previously. [12:06:29] weird btullis :( [12:06:41] btullis: what is the issue? [12:06:52] Which version of the username should I be using at https://idp.wikimedia.org/login ? btullis, Btullis or BTUllis (WMF) ? [12:07:37] It is your Wikimedia developer account, so in theory Btullis [12:08:03] the lowercase one is the shell name, the other should be your login for wiki-related things (yes I know confusing :( ) [12:08:53] elukey: Quick qeustion for you - Would you be aware of an issue the Saturday on upload-cache? We experienced a data-loss for 2021-07-10T11:00 [12:09:12] icinga is *very* picky about casing of names - I have to log in as Hnowlan to icinga but I use hnowlan absolutely everywhere else [12:10:08] hnowlan: In theory with CAS it should have been solved :( [12:11:18] I tried to log off, login as "Elukey" and enter icinga, it seems working [12:11:24] Thanks. I can't get access to any of these at the moment: https://wikitech.wikimedia.org/wiki/Single_Sign_On#What_sites_are_SSO_enabled? although I could access *some* of them prior to last week. [12:12:08] Maybe I'll just do the reset password link for now. That takes me to wikitech. [12:12:35] btullis: do you get errors while logging in, or after a successful SSO login? [12:12:43] (just to understand how to help) [12:13:51] During login: from SSO login screen. "Authentication attempt has failed, likely due to invalid credentials. Please verify and try again. " [12:14:16] lovely, I just checked LDAP for something weird but it seems good [12:15:42] If I go to https://directory.corp.wikimedia.org/ and use my lowercase LDAP username 'btullis' I can log in successfully. [12:16:24] ah that one is another LDAP IIRC, it is used by OIT for the gmail accounts etc.. [12:16:42] the wikimedia developer account should be the one that you also use to login into wikitech [12:16:45] (IIRC) [12:17:13] that creates the prod LDAP account (cn: Btullins, shell: btullis) [12:17:21] err *Btullis [12:17:51] the directory.corp.etc.. is basically needed only for the Gsuite [12:17:59] The authentication for the SSO login is actually case-insensitive, the only part which is not, is Icinga's internal handling of permissions (e.g. to be able to downtime a service) [12:18:17] since those are not read from LDAP, but instead of a CGI conffile we manage via Puppet [12:19:04] finer technical details at https://phabricator.wikimedia.org/T256656#6266825 [12:19:38] moritzm: but Ben gets an error at the CAS-SSO level, it may be due to password mismatch (corp LDAP account vs developer account) [12:19:42] it is very confusing [12:21:03] ah yes, that OIT username is entirely different [12:22:16] OK. Thanks all. I have reset my Wikitech password and I now have passed the SSO login stage and can access Icinga again. It's a different password from the directory.corp LDAP password, so I should be able to understand where it's used now. [12:23:27] btullis: perfect! [12:31:04] !log Rerun failed webrequest hour after having checked that loss was entirely false-positive [12:31:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:35:30] joal: sorry checking for saturday [12:35:52] elukey: np - I think it must have been a network glitch [12:38:27] joal: I didn't find much, was it confined to a specific dc or global? [12:40:22] nope, error in upload, warning for text, all DCs - Must have come from some weird unsync I assume? (we have many rows with sequence_id before H ending up in H+1) [12:42:07] yes I don't see things on fire around that time [12:42:49] no big deal elukey - I'll investigate more if this happens more (we have moved to Gobblin last week, shouldn't be related but eh) [13:15:53] hello! [13:17:19] joal shall we start on events? [13:17:25] turn on the gobblin job? [13:17:41] we surely can do that ottomata - good morning :) [13:17:52] mornin! [13:18:25] ottomata: I assume the first thing will be to make a job in event_gobblin folder, making sure it contains all expected folders and data for a few hours, and then switch? [13:18:29] yup! [13:18:37] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/703866 [13:19:12] hmmm [13:19:15] wrong final.dir [13:19:18] need _gobblin [13:19:40] reading [13:20:02] (03PS2) 10Ottomata: Add event_default gobblin job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) [13:20:27] ottomata: naming question - Should it be events_default instead of event_default? [13:21:33] i don't think so, the dir and database are called event [13:21:43] and the other one isn't called webrequests [13:21:44] :) [13:23:18] ottomata: event is a 'bundle' to plenty different streams of events - webrequest would not be such - and 'event' is acutally a DB, while webrequest is a table - But I get your point [13:24:05] hm, good point too, and webrequest itself is really an event [13:25:12] but, webrequest table has many webrequests, and we don't pluralize it [13:25:13] hm [13:25:46] i think i'd keep it the same for consistency, the refine job is refine_event [13:26:13] i maybe have a slight personal preference for avoiding plurals for things like this...but maybe i'm wrong about that and am not personally consistent :) [13:29:56] ottomata: works for me - no big deal [13:30:19] ok [13:30:32] going to merge and deploy https://gerrit.wikimedia.org/r/c/analytics/refinery/+/703866 [13:30:32] and then [13:30:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/703867 [13:32:04] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add event_default gobblin job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:37:51] (03CR) 10Joal: "One comment about number of mappers." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:37:57] OH JOAL [13:37:58] sorry [13:38:03] arf - ok too slow :) [13:38:20] thought your message above was approval, sorry! [13:38:34] sorry I was not monitoring IRC while writing my comment :) [13:38:39] nothing important though [13:38:55] hi team! [13:38:57] we can move forward as is, I'll monitor runs and suggest changes based on monitoring [13:39:01] hi mforns :) [13:39:05] joal do you want to reduce to just keep the max number of mappers low to reduce the amount of capacity the job takes up in the cluster? [13:39:30] i had thought that max mappers just meant max, so it would only use them if it needed, and it would unlikley be running that many at a time? [13:39:35] hi mforns 1 [13:39:36] ! [13:39:45] ottomata: yes, and to take advantage of single-jvm with multiple tasks [13:39:50] :D [13:40:08] ok lets reduce its not hard to do [13:40:25] ottomata: by default gobblin will use as many mappers as affordable from the number of tasks [13:40:45] so if we wish to take advantage of multi-tasks within mappers, we need to reduce [13:41:05] ottomata: we can also do it later, after having looked at mapper duration - if duration is very small, we can reduce :) [13:42:55] lets do it now, very easy [13:45:39] (03PS1) 10Ottomata: Set number of max mappers for gobblin event_default to 128 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704117 [13:45:43] joal ^ [13:45:58] (03CR) 10Ottomata: Add event_default gobblin job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:49:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set number of max mappers for gobblin event_default to 128 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704117 (owner: 10Ottomata) [13:55:20] (03PS1) 10Ottomata: gobbin event_default - Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704118 (https://phabricator.wikimedia.org/T271232) [13:55:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] gobbin event_default - Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704118 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:59:31] ottomata: sorry I'm not concentrated - Will follow you closely from now on :S [13:59:48] np! [13:59:51] initiating the first run [14:00:00] Ack ottomata - will check folders [14:00:23] ottomata: you do it manually, or by timer? [14:01:19] timer [14:01:29] Emitting WorkUnitsCreated Count: 33 [14:01:33] with things like [14:01:37] MultiWorkUnit 32: estimated load=0.004771, partitions=[[eqiad.eventgate-main.error.validation-0], [eqiad.w3c.reportingapi.network_error-0], [codfw.mediawiki.revision-tags-change-0]] [14:01:59] interesting! [14:01:59] Min load of multiWorkUnit = 0.003010; Max load of multiWorkUnit = 0.004771; Diff = 36.907025% [14:02:09] what is the load estimation? [14:02:13] # of partitions? [14:02:19] or some account of volume? [14:02:21] number of events and partitions [14:02:30] here, no events, so very low load [14:02:35] oh right [14:02:42] ok looks like first run finished [14:02:47] ok if I trigger 2nd? [14:03:09] sure, please do [14:03:16] k [14:04:04] ah ha [14:04:04] Min load of multiWorkUnit = 0.003010; Max load of multiWorkUnit = 51185.033253; Diff = 99.999994% [14:04:22] nice it eally does seem to do that very smartly! [14:04:33] looking briefly at the assignent of topic partitions to work units [14:04:37] small ones are grouped together [14:04:41] large ones get dedicated units [14:04:53] ottomata: I have looked at task-sizing estimators etc and it is modular so that we can change strategies etc [14:05:06] the smallest workunit with more than one topic partition is [14:05:07] MultiWorkUnit 17: estimated load=0.004771, partitions=[[eqiad.mediawiki.page-move-0], [codfw.test.instrumentation-0], [codfw.ios.edit_history_compare-0]] [14:05:20] which is good [14:05:24] very cool [14:05:55] ottomata: with this approach, the probability of starvation due to big topics is low [14:06:04] yeah [14:06:05] indeed! [14:06:06] very cool [14:06:20] wellllll depends on how big and how long the final import takes [14:06:44] ottomata: I'm happy we find some positive aspects to gobblin :) Would it be only trouble it'd be a shame :) [14:06:50] if the biggest toppar takes longer than the period between launching jobs (in this case an hour)_ [14:06:53] then it could [14:07:08] but i think with event sizes so far, this shoudl be ok [14:07:13] yes true ottomata [14:07:13] cool so the 2nd run finished and wrote data [14:07:16] lets look at dirs! [14:07:19] yup [14:07:53] looks great [14:08:15] ottomata: I don't know how we should look at those to monitor correctness though - We have no easy way to compare to different folders, right? [14:08:17] ok, so after meetings today maybe we can finalize [14:08:28] works for me ottomata [14:08:29] eh? [14:08:42] we could count # of records once we get a full hour [14:08:47] ottomata: For instance, checking that no folder is left-over [14:08:49] and compare toe vent [14:08:50] right? [14:08:52] oh [14:08:56] that we are getting them all? [14:09:02] for sure we can that - I was more thinking in term of strams [14:09:06] right [14:09:07] hm [14:09:27] hmm, we should be able to check once we get a full hour [14:09:36] since both dirs should have worked on the same streams [14:09:40] minus mediawiki.job [14:10:09] that's a good point though lets make sure to do that firsrt [14:10:26] get a full hour, get the list of topics imported by camus and gobblin in those dirs [14:10:28] and compare [14:10:36] it'll be a visual comparision, just making sure it looks right [14:13:52] ack ottomata [14:14:08] let's wait for some time before finalizing [14:14:21] ottomata: the job runs every hour? [14:19:02] ya [14:19:11] joal lets do after meetings today [14:19:20] maybe about 1pm my time? [14:19:45] 2h 40 mins ish from now? [14:20:04] hok ottomata :) [14:30:05] (03CR) 10Ottomata: "I don't have a great alternative suggestion, but the name of these schemas and streams seems like they could be a little more descriptive." (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj) [14:36:55] joal: recalling airflow work from week before last, I managed to make spark-sql work with a custom operator, but noticed that both Airflow's SparkSql operator and the command line spark-sql command do not support executing queries with --client-mode cluster, thus IIUC the reduce operations would be executed in the airflow node. I understand this is a con of spark-sql vs hive, no? [14:37:52] mforns: client-mode makes the spark driver being executed on cluster vs client (not the reduce) [14:38:18] isn't the reduce executed in the driver? [14:38:35] mforns: this is a downside nonetheless, as for some jobs drivers are big [14:38:45] understand [14:39:14] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) @Pchelolo, @colewhite Q: I'm getting close to getting this working, but I seem to be miss... [14:39:15] mforns: reduce is a parallel step, it's executed on workers usually [14:39:29] I see, thanks [14:40:33] mforns: i think usually as long as the driver isn't pulling data down and manipulating it locally, it'd be ok. [14:40:45] for just sql queries, i don't think that would (could?) happen [14:40:54] I also imagine not! [15:04:16] a-team standup! [15:23:39] ottomata: o/ I am going to deploy a change for the ML cluster, is it ok if I skip the SRE sync? Happy to answer anything on IRC later on of course [15:28:50] (I can join with 10 mins of delay otherwise) [15:32:03] elukey: sure! [15:37:17] ottomata: some trouble to work on sorry, will be available on IRC if needed :( [15:38:02] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Ottomata) Hiya, checking in! We'd love to move on {T275767}, any new ETA? Thanks! [15:38:25] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10colewhite) @Ottomata this metric is part of service-template node, but is not yet merged: https://g... [15:41:37] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10Ottomata) a:03BTullis [15:43:58] 10Analytics-Clusters, 10User-razzi: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10Ottomata) a:05razzi→03BTullis [15:45:31] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [15:46:56] 10Analytics-Clusters, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10Ottomata) a:03BTullis [15:51:57] 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10Ottomata) a:05razzi→03None [15:52:27] 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) a:03BTullis [15:53:05] 10Analytics-Clusters: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10Ottomata) a:05razzi→03BTullis [15:53:52] 10Analytics-Clusters, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10Ottomata) a:05razzi→03None [15:55:35] 10Analytics, 10Analytics-Kanban: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10Ottomata) a:03Ottomata [15:59:22] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata) a:05razzi→03None [16:03:10] 10Analytics, 10Analytics-Kanban: Fix gobblin not writing _IMPORTED flags when runs don't overlap hours - https://phabricator.wikimedia.org/T286343 (10odimitrijevic) p:05Triage→03High Not a prerequisite for the gobblin migration [16:13:51] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10odimitrijevic) p:05Triage→03High a:03Milimetric [16:28:05] ottomata: heya - ready when you wish for gobblin finalization [16:28:23] ottomata: I'm gonna start looking at some data to see if it seems correct [16:31:41] ok gr8! [16:33:03] 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) [16:33:05] 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata) [16:39:57] ok ottomata - I tested two folders (codfw_mediawiki_revision-create and codfw_wdqs-external_sparql-query), and they have the same number of rows from camus and gobblin [16:40:51] great! [16:41:25] ottomata: let's find a way to check for number-of-streams correctness, and then it's all good :) [16:50:23] joal https://gist.github.com/ottomata/a3a624818df9a9fcfc705a4359bd5b11 [16:50:56] \o/ [16:51:17] ok ottomata - let's wait for hour 16 to be finished maybe? [16:52:02] ok, then we can delete hour 14 and 15, right? [16:52:12] hm [16:52:23] ottomata: let's keep hour 15 [16:52:26] refine doesn't start --until 2 [16:52:29] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) [16:52:32] ottomata: currently checking where refine is [16:52:48] hm do we kinda need to wait for hour 17? [16:52:57] we could force refine for hour 15 and 16 asap [16:54:33] ottomata: I'd feel more confident if we had reach refined data from camus in gobblin (meaning refine having done hour 15) [16:55:11] ottomata: I think this will be the case in as bit more than 1H [16:55:30] if ok for you, I go and have diner with the kids now, and then we proceed? [16:55:35] as you wish ottomata [16:55:55] joal sounds good [16:56:05] right yeah i htink so too [16:56:13] we can make refine run early once 15 is done for cmaus [16:56:16] camus and do it [16:56:24] go ahead and lets do that after you done with kids ya! [16:56:38] ok let's do it this way then - Let's recombine 1h+ - thanks ottomata :) [16:56:49] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) [17:05:14] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) Oh great, thanks Cole. What do we need to do to get that merged and released? [17:23:26] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) NM, petr answered on PR. Working on it. [17:44:46] joal, remind me of what will change with gobblin in relation to https://gerrit.wikimedia.org/r/c/operations/puppet/+/702129/2/modules/profile/manifests/analytics/refinery/job/camus.pp#161 [17:45:01] we don't get email alerts anymore, right? [17:50:02] (03CR) 10Ppchelko: [C: 03+2] Bump jsonschema-tools to 0.10.3 and use skipSchemaTestCases [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703248 (https://phabricator.wikimedia.org/T285006) (owner: 10Ottomata) [17:50:47] (03Merged) 10jenkins-bot: Bump jsonschema-tools to 0.10.3 and use skipSchemaTestCases [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703248 (https://phabricator.wikimedia.org/T285006) (owner: 10Ottomata) [17:52:32] (03CR) 10Ppchelko: [C: 03+2] Rematerialize all schemas wiith enforced numeric bounds [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703249 (https://phabricator.wikimedia.org/T258659) (owner: 10Ottomata) [17:53:05] (03Merged) 10jenkins-bot: Rematerialize all schemas wiith enforced numeric bounds [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703249 (https://phabricator.wikimedia.org/T258659) (owner: 10Ottomata) [17:54:54] (03CR) 10Ppchelko: [C: 03+2] Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [17:55:26] (03Merged) 10jenkins-bot: Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata) [18:03:16] ottomata: No checker anymore, so _IMPORTED flags in both folders as needed [18:04:18] ottomata: The also reminds me that there is a difference of behavior between checker and gobblin: camus-checker was sending alert email when a topic doesn't move - this behavior doesn't raise alert with the new system [18:04:38] right [18:04:44] but, how will we know if there is a breakage? [18:05:22] ottomata: we won't - or at least not in the way it was before [18:05:45] i wonder if that will be aproblem [18:05:47] ottomata: I don't realize how useful this feature has been lately [18:05:53] yeah i'm not sure either [18:06:03] actually hm [18:06:06] for webrequest, we have SLAs etc - for events I don't know [18:06:19] hmm lets see how would this work with airflow? [18:06:33] i think we'd have a visual queue if data isn't present [18:06:45] for streams with canary events, this will work great [18:06:51] for streams wtihout...same problem we have now i guess [18:07:20] for stream with canaries we'll need a metric of events-imported (or data size) to be monitor actual change [18:07:46] right [18:08:06] but thats extra, I mostly wantt o be aleted about broken pipeline [18:08:22] like, maybe something is broken on the kafka mirror maker side of things [18:08:25] ottomata: gobblin should also provide similar metrics (imported items etc) [18:08:29] mirror maker stuck [18:08:34] i guess we shoud just have lag alerts for that [18:08:47] joal: oh? [18:08:50] how can we export those? [18:09:11] that woudl be nice, then we could just define alerts like we do now for certain topics and/or all streams with canary events [18:09:14] and check the gobblin metrics [18:09:24] yup [18:10:03] sounds like a job for https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway) [18:10:09] I will make a task [18:10:53] For the moment we have the metrics in files, but there exporters (graphite, files, kafka, jmx) [18:11:09] We should definitely be able to use PushGateway [18:11:37] it'd be really cool if pushing to Prometheus could work with kafka! :) [18:11:58] hm, like you push your metrics to Kafka, and prometheus reads from there? [18:12:01] yeah [18:12:02] 10Analytics: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10Ottomata) [18:12:09] anyway, not needed just would be cool [18:12:12] That'd be indeed a great way [18:12:17] maybe hard since prometheus doesn't work with timestamps [18:12:19] its all in the now [18:12:25] Arf, ok [18:13:32] joal ok, ready to do finalize events? [18:13:34] lets see... [18:13:37] sure [18:13:43] we should delete everything < hour 15, rght? [18:13:49] in event_gobblin ? [18:13:51] correct ottomata [18:14:03] for the moment refine is done to hour 14 [18:14:07] i'll go ahead and stop jobs [18:14:09] oh interesting. [18:14:19] should we refine hour 15 in camus now? [18:14:21] maybe the job is currently in the runnining? [18:14:35] not running atm [18:14:38] lemme stop jobs [18:14:55] i'll stop refine too [18:16:22] joal are we sure all hour 14s refines are 100% done [18:16:23] ? [18:16:35] nope ottomata [18:16:49] also, I'd really prefer if hour 15 would be done [18:17:01] I assume this hour job has not been launched maybe? [18:17:18] !log stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job [18:17:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:17:26] joal: lets manually launch it [18:17:52] yeah joal it is supppsed to run :20 after the hour [18:17:54] so it was about to go [18:17:59] i'll launch it manually now [18:17:59] right [18:18:02] thanks [18:18:54] ok its running [18:18:58] application_1623774792907_141596 [18:25:58] joal just checked, hour 14 looks good as far as I can tell [18:26:05] great [18:26:57] joal reifne job done [18:27:00] checked for hour=15 too [18:27:01] looks good [18:27:34] and as expected, hour 16 still needs done [18:27:35] ok [18:27:44] so now [18:28:10] joal will you do the data moves? [18:28:20] absent camus job, update gobblin folder in job, data moves, deploy and restart [18:28:25] Yessir - can do [18:28:37] k great, i'm double checking the puppet patch to do ^ [18:28:47] joal we need to delete hour=14 nad hour=15, right? [18:28:50] in gobblin data? [18:28:58] or, not move it into place? [18:28:58] ottomata: no, why? [18:29:07] well hour 14 is incomplete [18:29:11] so that one for sure, right? [18:29:18] but hour 15 will be re-refined [18:29:23] which...is ok [18:29:24] ah sorry, yes, hour=14 we need to drop [18:29:35] ottomata: maybe not - depending on timestamps :) [18:29:41] acutally [18:29:58] joal i think it will......at least, it did when I did it in test cluster on friday [18:30:00] but actually [18:30:04] we have the output of the hour 15 refine job [18:30:05] so [18:30:16] it might be interesting to let it re-refine hour 15 with gobblin data [18:30:17] and we can compare [18:30:33] lemme collect the log output for hour 15 datasets that were just refined [18:30:36] ottomata: you wish we move hour=15 refined data somewhere else? [18:30:55] no joal will just look at log output of the refinie job [18:30:56] and compare [18:31:01] since it prints # of refnied records [18:31:25] ack - works for me ottomata - not complicated to move it if you prefer [18:32:11] joal https://gist.github.com/ottomata/a1f5ba7bd538166d4686d247cad19a0d [18:32:15] naw lets keep 15 in [18:32:17] it will be interesting [18:32:20] OH wait [18:32:21] no [18:32:22] wait [18:32:23] yes! [18:32:24] haha [18:32:25] sorry yes [18:32:25] . [18:32:26] lets keep it in [18:32:42] in, as is moved in a different place? [18:32:50] so that we keep it? [18:33:10] joal as in, just remove hour 14 [18:33:13] then do the directory swa [18:33:19] event -> event_camus, event_gobblin -> event [18:33:40] we can then launch refine and see what happens, i think it will re-refine all the hour 15s again but with the gobblin data [18:33:45] and we can compare and see [18:34:32] ok works for me ottomata [18:35:26] oh, i'll deploy gobblni job change for dir swap... [18:36:21] !log Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events [18:36:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:36:25] (03PS1) 10Ottomata: Finalize gobblin migration for event_default job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704152 (https://phabricator.wikimedia.org/T271232) [18:37:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Finalize gobblin migration for event_default job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704152 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:37:03] !log Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event [18:37:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:37:19] Ok ottomata - move done [18:37:28] k [18:37:36] deploying ^ [18:38:06] joal going to launch manual refine again [18:38:07] oh [18:38:08] no [18:38:14] gotta do puppet first [18:38:14] with new config [18:38:15] ok hang on [18:38:22] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/703869 [18:38:29] i'll merge that after refinery deploy done [18:38:56] ack [18:44:51] ok now starting refine [18:44:52] lets see [18:45:20] application_1623774792907_141718 [18:45:40] failed [18:45:42] checking why [18:46:04] mis config hang on [18:50:34] ottomata: side note - can you please remind me if there is a "analytics-product" group for HDFS, with correct setup for people etc? [18:51:18] i believe there is a system user, yes [18:52:17] joal related [18:52:17] https://phabricator.wikimedia.org/T285503 [18:53:17] ok refine running now [18:53:17] application_1623774792907_141724 [18:53:59] Thanks for the link ottomata - can you confirm that product-analytics users are part of the group? [18:54:11] they should be yes [18:54:24] meh - actually -let's solve this at higher level with the task you mentionned [18:54:27] i mean, maybe not all if they don't have ssh access? [18:54:35] hm -right [18:55:16] joal [18:55:16] https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L788-L793 [18:55:42] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) [18:57:23] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure d... [20:09:08] great stuff joal, event looks good!~ [20:09:23] eventlogging_legacy still to do, but this is not as easy as in the test cluster because we are not fully migratetd [20:09:32] will prep some patches for tomorrow [20:17:49] (03PS1) 10Ottomata: Add gobblin job eventlogging_legacy [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) [20:18:27] (03CR) 10Ottomata: Add gobblin job eventlogging_legacy (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:21:18] (03PS2) 10Ottomata: Add gobblin job eventlogging_legacy [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) [21:28:12] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) >>! In T286065#7205088, @cmooney wrote: > @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently inc...