[02:21:57] <wikibugs>	 10Analytics-Clusters, 10Analytics-Radar, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10lmata)
[02:27:18] <wikibugs>	 10Analytics, 10SRE Observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10lmata)
[02:39:36] <wikibugs>	 10Analytics-Radar, 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata)
[07:38:15] <wikibugs>	 (03PS8) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152
[07:39:33] <wikibugs>	 (03CR) 10Martaannaj: Create wd_propertysuggester/client_ab_testing and wd_propertysuggester/server_ab_testing (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj)
[07:39:45] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10elukey)
[07:53:30] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Rematerialize fragment schemas with generated examples. (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/702700 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata)
[08:03:53] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[08:04:04] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[08:04:49] <wikibugs>	 10Analytics, 10SRE, 10Tracking-Neverending: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10fgiunchedi)
[08:08:49] <wikibugs>	 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10RhinosF1)
[08:13:13] <wikibugs>	 10Analytics, 10SRE: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Ladsgroup) Is it coming from puppet? It should be migrated to systemd timer if that's the case: {T273673}
[08:36:55] <btullis>	 Morning all.
[08:41:33] <elukey>	 good morning :)
[08:59:57] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) >>! In T286065#7194569, @Bstorm wrote:  > @aborrero does cloudgw require manual failover?  it doesn't require manual failover, but we could...
[09:02:15] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat)
[09:07:07] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:07:53] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:12:04] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:13:25] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:15:19] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:15:56] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[09:29:30] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff)
[09:38:52] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[10:26:36] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[10:29:13] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to //cl...
[10:33:10] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney)
[10:33:43] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) @Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to cl...
[11:34:08] <joal>	 Hi teq,
[11:34:20] <joal>	 team - sorry - wrong layout :)
[11:56:25] <elukey>	 bonjour joal 
[11:56:36] <joal>	 Hi elukey - How are you?
[11:57:14] <joal>	 elukey: feeling like a european-champ? ;)
[11:59:12] <elukey>	 joal: \o/ \o/ \o/
[12:06:12] <btullis>	 I'm having a bit of an issue with the SSO this morning. Can't get into Icinga for example, although I've definitely logged in successfully previously.
[12:06:29] <joal>	 weird btullis :(
[12:06:41] <elukey>	 btullis: what is the issue?
[12:06:52] <btullis>	 Which version of the username should I be using at https://idp.wikimedia.org/login ? btullis, Btullis or BTUllis (WMF) ?
[12:07:37] <elukey>	 It is your Wikimedia developer account, so in theory Btullis
[12:08:03] <elukey>	 the lowercase one is the shell name, the other should be your login for wiki-related things (yes I know confusing :( )
[12:08:53] <joal>	 elukey: Quick qeustion for you - Would you be aware of an issue the Saturday on upload-cache? We experienced a data-loss for 2021-07-10T11:00
[12:09:12] <hnowlan>	 icinga is *very* picky about casing of names - I have to log in as Hnowlan to icinga but I use hnowlan absolutely everywhere else 
[12:10:08] <elukey>	 hnowlan: In theory with CAS it should have been solved :(
[12:11:18] <elukey>	 I tried to log off, login as "Elukey" and enter icinga, it seems working
[12:11:24] <btullis>	 Thanks. I can't get access to any of these at the moment: https://wikitech.wikimedia.org/wiki/Single_Sign_On#What_sites_are_SSO_enabled? although I could access *some* of them prior to last week.
[12:12:08] <btullis>	 Maybe I'll just do the reset password link for now. That takes me to wikitech.
[12:12:35] <elukey>	 btullis: do you get errors while logging in, or after a successful SSO login?
[12:12:43] <elukey>	 (just to understand how to help)
[12:13:51] <btullis>	 During login: from SSO login screen. "Authentication attempt has failed, likely due to invalid credentials. Please verify and try again. "
[12:14:16] <elukey>	 lovely, I just checked LDAP for something weird but it seems good
[12:15:42] <btullis>	 If I go to https://directory.corp.wikimedia.org/ and use my lowercase LDAP username 'btullis' I can log in successfully.
[12:16:24] <elukey>	 ah that one is another LDAP IIRC, it is used by OIT for the gmail accounts etc..
[12:16:42] <elukey>	 the wikimedia developer account should be the one that you also use to login into wikitech
[12:16:45] <elukey>	 (IIRC)
[12:17:13] <elukey>	 that creates the prod LDAP account (cn: Btullins, shell: btullis)
[12:17:21] <elukey>	 err *Btullis
[12:17:51] <elukey>	 the directory.corp.etc.. is basically needed only for the Gsuite 
[12:17:59] <moritzm>	 The authentication for the SSO login is actually case-insensitive, the only part which is not, is Icinga's internal handling of permissions (e.g. to be able to downtime a service)
[12:18:17] <moritzm>	 since those are not read from LDAP, but instead of  a CGI conffile we manage via Puppet
[12:19:04] <moritzm>	 finer technical details at https://phabricator.wikimedia.org/T256656#6266825
[12:19:38] <elukey>	 moritzm: but Ben gets an error at the CAS-SSO level, it may be due to password mismatch (corp LDAP account vs developer account)
[12:19:42] <elukey>	 it is very confusing
[12:21:03] <moritzm>	 ah yes, that OIT username is entirely different
[12:22:16] <btullis>	 OK. Thanks all. I have reset my Wikitech password and I now have passed the SSO login stage and can access Icinga again. It's a different password from the directory.corp LDAP password, so I should be able to understand where it's used now.
[12:23:27] <elukey>	 btullis: perfect!
[12:31:04] <joal>	 !log Rerun failed webrequest hour after having checked that loss was entirely false-positive
[12:31:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:35:30] <elukey>	 joal: sorry checking for saturday
[12:35:52] <joal>	 elukey: np - I think it must have been a network glitch
[12:38:27] <elukey>	 joal: I didn't find much, was it confined to a specific dc or global?
[12:40:22] <joal>	 nope, error in upload, warning for text, all DCs - Must have come from some weird unsync I assume? (we have many rows with sequence_id before H ending up in H+1)
[12:42:07] <elukey>	 yes I don't see things on fire around that time
[12:42:49] <joal>	 no big deal elukey - I'll investigate more if this happens more (we have moved to Gobblin last week, shouldn't be related but eh)
[13:15:53] <ottomata>	 hello!
[13:17:19] <ottomata>	 joal shall we start on events?
[13:17:25] <ottomata>	 turn on the gobblin job?
[13:17:41] <joal>	 we surely can do that ottomata - good morning :)
[13:17:52] <ottomata>	 mornin!
[13:18:25] <joal>	 ottomata: I assume the first thing will be to make a job in event_gobblin folder, making sure it contains all expected folders and data for a few hours, and then switch?
[13:18:29] <ottomata>	 yup!
[13:18:37] <ottomata>	 https://gerrit.wikimedia.org/r/c/analytics/refinery/+/703866
[13:19:12] <ottomata>	 hmmm
[13:19:15] <ottomata>	 wrong final.dir
[13:19:18] <ottomata>	 need _gobblin
[13:19:40] <joal>	 reading
[13:20:02] <wikibugs>	 (03PS2) 10Ottomata: Add event_default gobblin job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232)
[13:20:27] <joal>	 ottomata: naming question - Should it be events_default instead of event_default?
[13:21:33] <ottomata>	 i don't think so, the dir and database are called event
[13:21:43] <ottomata>	 and the other one isn't called webrequests
[13:21:44] <ottomata>	 :)
[13:23:18] <joal>	 ottomata: event is a 'bundle' to plenty different streams of events - webrequest would not be such - and 'event' is acutally a DB, while webrequest is a table - But I get your point
[13:24:05] <ottomata>	 hm, good point too, and webrequest itself is really an event
[13:25:12] <ottomata>	 but, webrequest table has many webrequests, and we don't pluralize it
[13:25:13] <ottomata>	 hm
[13:25:46] <ottomata>	 i think i'd keep it the same for consistency, the refine job is refine_event
[13:26:13] <ottomata>	 i maybe have a slight personal preference for avoiding plurals for things like this...but maybe i'm wrong about that and am not personally consistent :)
[13:29:56] <joal>	 ottomata: works for me - no big deal
[13:30:19] <ottomata>	 ok
[13:30:32] <ottomata>	 going to merge and deploy https://gerrit.wikimedia.org/r/c/analytics/refinery/+/703866
[13:30:32] <ottomata>	 and then
[13:30:35] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/703867
[13:32:04] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add event_default gobblin job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[13:37:51] <wikibugs>	 (03CR) 10Joal: "One comment about number of mappers." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[13:37:57] <ottomata>	 OH JOAL
[13:37:58] <ottomata>	 sorry
[13:38:03] <joal>	 arf - ok too slow :)
[13:38:20] <ottomata>	 thought your message above was approval, sorry!
[13:38:34] <joal>	 sorry I was not monitoring IRC while writing my comment :)
[13:38:39] <joal>	 nothing important though
[13:38:55] <mforns>	 hi team!
[13:38:57] <joal>	 we can move forward as is, I'll monitor runs and suggest changes based on monitoring
[13:39:01] <joal>	 hi mforns :)
[13:39:05] <ottomata>	 joal do you want to reduce to just keep the max number of mappers low to reduce the amount of capacity the job takes up in the cluster?
[13:39:30] <ottomata>	 i had thought that max mappers just meant max, so it would only use them if it needed, and it would unlikley be running that many at a time?
[13:39:35] <ottomata>	 hi mforns 1
[13:39:36] <ottomata>	 !
[13:39:45] <joal>	 ottomata: yes, and to take advantage of single-jvm with multiple tasks
[13:39:50] <mforns>	 :D
[13:40:08] <ottomata>	 ok lets reduce its not hard to do
[13:40:25] <joal>	 ottomata: by default gobblin will use as many mappers as affordable from the number of tasks
[13:40:45] <joal>	 so if we wish to take advantage of multi-tasks within mappers, we need to reduce
[13:41:05] <joal>	 ottomata: we can also do it later, after having looked at mapper duration - if duration is very small, we can reduce :)
[13:42:55] <ottomata>	 lets do it now, very easy
[13:45:39] <wikibugs>	 (03PS1) 10Ottomata: Set number of max mappers for gobblin event_default to 128 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704117
[13:45:43] <ottomata>	 joal ^
[13:45:58] <wikibugs>	 (03CR) 10Ottomata: Add event_default gobblin job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/703866 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[13:49:11] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Set number of max mappers for gobblin event_default to 128 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704117 (owner: 10Ottomata)
[13:55:20] <wikibugs>	 (03PS1) 10Ottomata: gobbin event_default - Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704118 (https://phabricator.wikimedia.org/T271232)
[13:55:36] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] gobbin event_default - Fix typo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704118 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[13:59:31] <joal>	 ottomata: sorry I'm not concentrated - Will follow you closely from now on :S
[13:59:48] <ottomata>	 np!
[13:59:51] <ottomata>	 initiating the first run
[14:00:00] <joal>	 Ack ottomata - will check folders
[14:00:23] <joal>	 ottomata: you do it manually, or by timer?
[14:01:19] <ottomata>	 timer
[14:01:29] <ottomata>	 Emitting WorkUnitsCreated Count: 33
[14:01:33] <ottomata>	 with things like 
[14:01:37] <ottomata>	 MultiWorkUnit 32: estimated load=0.004771, partitions=[[eqiad.eventgate-main.error.validation-0], [eqiad.w3c.reportingapi.network_error-0], [codfw.mediawiki.revision-tags-change-0]]
[14:01:59] <ottomata>	 interesting!
[14:01:59] <ottomata>	  Min load of multiWorkUnit = 0.003010; Max load of multiWorkUnit = 0.004771; Diff = 36.907025%
[14:02:09] <ottomata>	 what is the load estimation?
[14:02:13] <ottomata>	 # of partitions?
[14:02:19] <ottomata>	 or some account of volume?
[14:02:21] <joal>	 number of events and partitions
[14:02:30] <joal>	 here, no events, so very low load
[14:02:35] <ottomata>	 oh right
[14:02:42] <ottomata>	 ok looks like first run finished
[14:02:47] <ottomata>	 ok if I trigger 2nd?
[14:03:09] <joal>	 sure, please do
[14:03:16] <ottomata>	 k
[14:04:04] <ottomata>	 ah ha
[14:04:04] <ottomata>	 Min load of multiWorkUnit = 0.003010; Max load of multiWorkUnit = 51185.033253; Diff = 99.999994%
[14:04:22] <ottomata>	 nice it eally does seem to do that very smartly!
[14:04:33] <ottomata>	 looking briefly at the assignent of topic partitions to work units
[14:04:37] <ottomata>	 small ones are grouped together
[14:04:41] <ottomata>	 large ones get dedicated units
[14:04:53] <joal>	 ottomata: I have looked at task-sizing estimators etc and it is modular so that we can change strategies etc
[14:05:06] <ottomata>	 the smallest workunit with more than one topic partition is
[14:05:07] <ottomata>	 MultiWorkUnit 17: estimated load=0.004771, partitions=[[eqiad.mediawiki.page-move-0], [codfw.test.instrumentation-0], [codfw.ios.edit_history_compare-0]]
[14:05:20] <ottomata>	 which is good
[14:05:24] <ottomata>	 very cool
[14:05:55] <joal>	 ottomata: with this approach, the probability of starvation due to big topics is low
[14:06:04] <ottomata>	 yeah
[14:06:05] <ottomata>	 indeed!
[14:06:06] <ottomata>	 very cool
[14:06:20] <ottomata>	 wellllll depends on how big and how long the final import takes
[14:06:44] <joal>	 ottomata: I'm happy we find some positive aspects to gobblin :) Would it be only trouble it'd be a shame :)
[14:06:50] <ottomata>	 if the biggest toppar  takes longer than the period between launching jobs (in this case an hour)_
[14:06:53] <ottomata>	 then it could
[14:07:08] <ottomata>	 but i think with event sizes so far, this shoudl be ok
[14:07:13] <joal>	 yes true ottomata
[14:07:13] <ottomata>	 cool so the 2nd run finished and wrote data
[14:07:16] <ottomata>	 lets look at dirs!
[14:07:19] <joal>	 yup
[14:07:53] <ottomata>	 looks great
[14:08:15] <joal>	 ottomata: I don't know how we should look at those to monitor correctness though - We have no easy way to compare to different folders, right?
[14:08:17] <ottomata>	 ok, so after meetings today maybe we can finalize
[14:08:28] <joal>	 works for me ottomata 
[14:08:29] <ottomata>	 eh?
[14:08:42] <ottomata>	 we could count # of records once we get a full hour
[14:08:47] <joal>	 ottomata: For instance, checking that no folder is left-over
[14:08:49] <ottomata>	 and compare toe vent
[14:08:50] <ottomata>	 right?
[14:08:52] <ottomata>	 oh
[14:08:56] <ottomata>	 that we are getting them all?
[14:09:02] <joal>	 for sure we can that - I was more thinking in term of strams
[14:09:06] <ottomata>	 right
[14:09:07] <ottomata>	 hm
[14:09:27] <ottomata>	 hmm, we should be able to check once we get a full hour
[14:09:36] <ottomata>	 since both dirs should have worked on the same streams
[14:09:40] <ottomata>	 minus mediawiki.job
[14:10:09] <ottomata>	 that's a good point though lets make sure to do that firsrt
[14:10:26] <ottomata>	 get a full hour, get the list of topics imported by camus and gobblin in those dirs
[14:10:28] <ottomata>	 and compare
[14:10:36] <ottomata>	 it'll be a visual comparision, just making sure it looks right
[14:13:52] <joal>	 ack ottomata 
[14:14:08] <joal>	 let's wait for some time before finalizing
[14:14:21] <joal>	 ottomata: the job runs every hour?
[14:19:02] <ottomata>	 ya
[14:19:11] <ottomata>	 joal lets do after meetings today
[14:19:20] <ottomata>	 maybe about 1pm my time?
[14:19:45] <ottomata>	 2h 40 mins ish from now?
[14:20:04] <joal>	 hok ottomata :)
[14:30:05] <wikibugs>	 (03CR) 10Ottomata: "I don't have a great alternative suggestion, but the name of these schemas and streams seems like they could be a little more descriptive." (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/689152 (owner: 10Martaannaj)
[14:36:55] <mforns>	 joal: recalling airflow work from week before last, I managed to make spark-sql work with a custom operator, but noticed that both Airflow's SparkSql operator and the command line spark-sql command do not support executing queries with --client-mode cluster, thus IIUC the reduce operations would be executed in the airflow node. I understand this is a con of spark-sql vs hive, no?
[14:37:52] <joal>	 mforns: client-mode makes the spark driver being executed on cluster vs client (not the reduce)
[14:38:18] <mforns>	 isn't the reduce executed in the driver?
[14:38:35] <joal>	 mforns: this is a downside nonetheless, as for some jobs drivers are big
[14:38:45] <mforns>	 understand
[14:39:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) @Pchelolo, @colewhite Q: I'm getting close to getting this working, but I seem to be miss...
[14:39:15] <joal>	 mforns: reduce is a parallel step, it's executed on workers usually
[14:39:29] <mforns>	 I see, thanks
[14:40:33] <ottomata>	 mforns:  i think usually as long as the driver isn't pulling data down and manipulating it locally, it'd be ok.  
[14:40:45] <ottomata>	 for just sql queries, i don't think that would (could?) happen
[14:40:54] <mforns>	 I also imagine not!
[15:04:16] <ottomata>	 a-team standup!
[15:23:39] <elukey>	 ottomata: o/ I am going to deploy a change for the ML cluster, is it ok if I skip the SRE sync? Happy to answer anything on IRC later on of course
[15:28:50] <elukey>	 (I can join with 10 mins of delay otherwise)
[15:32:03] <ottomata>	 elukey:  sure! 
[15:37:17] <elukey>	 ottomata: some trouble to work on sorry, will be available on IRC if needed :(
[15:38:02] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Ottomata) Hiya, checking in!  We'd love to move on {T275767}, any new ETA?  Thanks!
[15:38:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10colewhite) @Ottomata this metric is part of service-template node, but is not yet merged: https://g...
[15:41:37] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10Ottomata) a:03BTullis
[15:43:58] <wikibugs>	 10Analytics-Clusters, 10User-razzi: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10Ottomata) a:05razzi→03BTullis
[15:45:31] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff)
[15:46:56] <wikibugs>	 10Analytics-Clusters, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10Ottomata) a:03BTullis
[15:51:57] <wikibugs>	 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10Ottomata) a:05razzi→03None
[15:52:27] <wikibugs>	 10Analytics-Clusters: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) a:03BTullis
[15:53:05] <wikibugs>	 10Analytics-Clusters: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10Ottomata) a:05razzi→03BTullis
[15:53:52] <wikibugs>	 10Analytics-Clusters, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10Ottomata) a:05razzi→03None
[15:55:35] <wikibugs>	 10Analytics, 10Analytics-Kanban: Fix default ownership and permissions for Hive managed databases in /user/hive/warehouse - https://phabricator.wikimedia.org/T280175 (10Ottomata) a:03Ottomata
[15:59:22] <wikibugs>	 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata) a:05razzi→03None
[16:03:10] <wikibugs>	 10Analytics, 10Analytics-Kanban: Fix gobblin not writing _IMPORTED flags when runs don't overlap hours - https://phabricator.wikimedia.org/T286343 (10odimitrijevic) p:05Triage→03High Not a prerequisite for the gobblin migration
[16:13:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10odimitrijevic) p:05Triage→03High a:03Milimetric
[16:28:05] <joal>	 ottomata: heya - ready when you wish for gobblin finalization
[16:28:23] <joal>	 ottomata: I'm gonna start looking at some data to see if it seems correct
[16:31:41] <ottomata>	 ok gr8!
[16:33:03] <wikibugs>	 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata)
[16:33:05] <wikibugs>	 10Analytics, 10Analytics-Kanban: Data drifts between superset_production on an-coord1001 and db1108 - https://phabricator.wikimedia.org/T279440 (10Ottomata)
[16:39:57] <joal>	 ok ottomata - I tested two folders (codfw_mediawiki_revision-create and codfw_wdqs-external_sparql-query), and they have the same number of rows from camus and gobblin
[16:40:51] <ottomata>	 great!
[16:41:25] <joal>	 ottomata: let's find a way to check for number-of-streams correctness, and then it's all good :)
[16:50:23] <ottomata>	 joal https://gist.github.com/ottomata/a3a624818df9a9fcfc705a4359bd5b11
[16:50:56] <joal>	 \o/
[16:51:17] <joal>	 ok ottomata - let's wait for hour 16 to be finished maybe?
[16:52:02] <ottomata>	 ok, then we can delete hour 14 and 15, right?
[16:52:12] <ottomata>	 hm
[16:52:23] <joal>	 ottomata: let's keep hour 15
[16:52:26] <ottomata>	 refine doesn't start --until 2
[16:52:29] <wikibugs>	 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov)
[16:52:32] <joal>	 ottomata: currently checking where refine is
[16:52:48] <ottomata>	 hm do we kinda need to wait for hour 17?
[16:52:57] <ottomata>	 we could force refine for hour 15 and 16 asap
[16:54:33] <joal>	 ottomata: I'd feel more confident if we had reach refined data from camus in gobblin (meaning refine having done hour 15)
[16:55:11] <joal>	 ottomata: I think this will be the case in as bit more than 1H
[16:55:30] <joal>	 if ok for you, I go and have diner with the kids now, and then we proceed?
[16:55:35] <joal>	 as you wish ottomata 
[16:55:55] <ottomata>	 joal sounds good
[16:56:05] <ottomata>	 right yeah i htink so too
[16:56:13] <ottomata>	 we can make refine run early once 15 is done for cmaus
[16:56:16] <ottomata>	 camus and do it
[16:56:24] <ottomata>	 go ahead and lets do that after you done with kids ya!
[16:56:38] <joal>	 ok let's do it this way then - Let's recombine 1h+ - thanks ottomata :)
[16:56:49] <wikibugs>	 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov)
[17:05:14] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) Oh great, thanks Cole.  What do we need to do to get that merged and released?
[17:23:26] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, 10Patch-For-Review: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) NM, petr answered on PR. Working on it.
[17:44:46] <ottomata>	 joal, remind me of what will change with gobblin in relation to https://gerrit.wikimedia.org/r/c/operations/puppet/+/702129/2/modules/profile/manifests/analytics/refinery/job/camus.pp#161
[17:45:01] <ottomata>	 we don't get email alerts anymore, right?
[17:50:02] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Bump jsonschema-tools to 0.10.3 and use skipSchemaTestCases [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703248 (https://phabricator.wikimedia.org/T285006) (owner: 10Ottomata)
[17:50:47] <wikibugs>	 (03Merged) 10jenkins-bot: Bump jsonschema-tools to 0.10.3 and use skipSchemaTestCases [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703248 (https://phabricator.wikimedia.org/T285006) (owner: 10Ottomata)
[17:52:32] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Rematerialize all schemas wiith enforced numeric bounds [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703249 (https://phabricator.wikimedia.org/T258659) (owner: 10Ottomata)
[17:53:05] <wikibugs>	 (03Merged) 10jenkins-bot: Rematerialize all schemas wiith enforced numeric bounds [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703249 (https://phabricator.wikimedia.org/T258659) (owner: 10Ottomata)
[17:54:54] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+2] Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata)
[17:55:26] <wikibugs>	 (03Merged) 10jenkins-bot: Set shouldGenerateExample: true and rematerialize schemas to get examples everywhere [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/703250 (https://phabricator.wikimedia.org/T270134) (owner: 10Ottomata)
[18:03:16] <joal>	 ottomata: No checker anymore, so _IMPORTED flags in both folders as needed
[18:04:18] <joal>	 ottomata: The also reminds me that there is a difference of behavior between checker and gobblin: camus-checker was sending alert email when a topic doesn't move - this behavior doesn't raise alert with the new system
[18:04:38] <ottomata>	 right
[18:04:44] <ottomata>	 but, how will we know if there is a breakage?
[18:05:22] <joal>	 ottomata: we won't - or at least not in the way it was before
[18:05:45] <ottomata>	 i wonder if that will be aproblem
[18:05:47] <joal>	 ottomata: I don't realize how useful this feature has been lately
[18:05:53] <ottomata>	 yeah i'm not sure either
[18:06:03] <ottomata>	 actually hm
[18:06:06] <joal>	 for webrequest, we have SLAs etc - for events I don't know
[18:06:19] <ottomata>	 hmm  lets see how would this work with airflow?
[18:06:33] <ottomata>	 i think we'd have a visual queue if data isn't present
[18:06:45] <ottomata>	 for streams with canary events, this will work great
[18:06:51] <ottomata>	 for streams wtihout...same problem we have now i guess
[18:07:20] <joal>	 for stream with canaries we'll need a metric of events-imported (or data size) to be monitor actual change
[18:07:46] <ottomata>	 right
[18:08:06] <ottomata>	 but thats extra,  I mostly wantt o be aleted about broken pipeline
[18:08:22] <ottomata>	 like, maybe something is broken on the kafka mirror maker side of things
[18:08:25] <joal>	 ottomata: gobblin should also provide similar metrics (imported items etc)
[18:08:29] <ottomata>	 mirror maker stuck
[18:08:34] <ottomata>	 i guess we shoud just have lag alerts for that
[18:08:47] <ottomata>	 joal:  oh?
[18:08:50] <ottomata>	 how can we export those?
[18:09:11] <ottomata>	 that woudl be nice, then we could just define alerts like we do now for certain topics and/or all streams with canary events
[18:09:14] <ottomata>	 and check the gobblin metrics
[18:09:24] <joal>	 yup
[18:10:03] <ottomata>	 sounds like a job for https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway)
[18:10:09] <ottomata>	 I will make a task
[18:10:53] <joal>	 For the moment we have the metrics in files, but there exporters (graphite, files, kafka, jmx)
[18:11:09] <joal>	 We should definitely be able to use PushGateway
[18:11:37] <ottomata>	 it'd be really cool if pushing to Prometheus could work with kafka! :)
[18:11:58] <joal>	 hm, like you push your metrics to Kafka, and prometheus reads from there?
[18:12:01] <ottomata>	 yeah
[18:12:02] <wikibugs>	 10Analytics: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10Ottomata)
[18:12:09] <ottomata>	 anyway, not needed just would be cool
[18:12:12] <joal>	 That'd be indeed a great way
[18:12:17] <ottomata>	 maybe hard since prometheus doesn't work with timestamps
[18:12:19] <ottomata>	 its all in the now
[18:12:25] <joal>	 Arf, ok
[18:13:32] <ottomata>	 joal ok, ready to do finalize events?
[18:13:34] <ottomata>	 lets see...
[18:13:37] <joal>	 sure
[18:13:43] <ottomata>	 we should delete everything < hour 15, rght?
[18:13:49] <ottomata>	 in event_gobblin ?
[18:13:51] <joal>	 correct ottomata 
[18:14:03] <joal>	 for the moment refine is done to hour 14
[18:14:07] <ottomata>	 i'll go ahead and stop jobs
[18:14:09] <ottomata>	 oh interesting.
[18:14:19] <ottomata>	 should we refine hour 15 in camus now?
[18:14:21] <joal>	 maybe the job is currently in the runnining?
[18:14:35] <ottomata>	 not running atm
[18:14:38] <ottomata>	 lemme stop jobs
[18:14:55] <ottomata>	 i'll stop refine too
[18:16:22] <ottomata>	 joal are we sure all hour 14s refines are 100% done
[18:16:23] <ottomata>	 ?
[18:16:35] <joal>	 nope ottomata 
[18:16:49] <joal>	 also, I'd really prefer if hour 15 would be done
[18:17:01] <joal>	 I assume this hour job has not been launched maybe?
[18:17:18] <ottomata>	 !log stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job
[18:17:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:17:26] <ottomata>	 joal:  lets manually launch it
[18:17:52] <ottomata>	 yeah joal it is supppsed to run :20 after the hour
[18:17:54] <ottomata>	 so it was about to go
[18:17:59] <ottomata>	 i'll launch it manually now
[18:17:59] <joal>	 right
[18:18:02] <joal>	 thanks
[18:18:54] <ottomata>	 ok its running
[18:18:58] <ottomata>	 application_1623774792907_141596
[18:25:58] <ottomata>	 joal just checked, hour 14 looks good as far as I can tell
[18:26:05] <joal>	 great
[18:26:57] <ottomata>	 joal reifne job done
[18:27:00] <ottomata>	 checked for hour=15 too
[18:27:01] <ottomata>	 looks good
[18:27:34] <ottomata>	 and as expected, hour 16 still needs done
[18:27:35] <ottomata>	 ok
[18:27:44] <ottomata>	 so now
[18:28:10] <ottomata>	 joal will you do the data moves?
[18:28:20] <joal>	 absent camus job, update gobblin folder in job, data moves, deploy and restart
[18:28:25] <joal>	 Yessir - can do
[18:28:37] <ottomata>	 k great, i'm double checking the puppet patch to do ^
[18:28:47] <ottomata>	 joal we need to delete hour=14 nad hour=15, right?
[18:28:50] <ottomata>	 in gobblin data?
[18:28:58] <ottomata>	 or, not move it into place?
[18:28:58] <joal>	 ottomata: no, why?
[18:29:07] <ottomata>	 well hour 14 is incomplete
[18:29:11] <ottomata>	 so that one for sure, right?
[18:29:18] <ottomata>	 but hour 15 will be re-refined
[18:29:23] <ottomata>	 which...is ok
[18:29:24] <joal>	 ah sorry, yes, hour=14 we need to drop
[18:29:35] <joal>	 ottomata: maybe not - depending on timestamps :)
[18:29:41] <ottomata>	 acutally
[18:29:58] <ottomata>	 joal i think it will......at least, it did when I did it in test cluster on friday
[18:30:00] <ottomata>	 but actually
[18:30:04] <ottomata>	 we have the output of the hour 15 refine job
[18:30:05] <ottomata>	 so
[18:30:16] <ottomata>	 it might be interesting to let it re-refine hour 15 with gobblin data
[18:30:17] <ottomata>	 and we can compare
[18:30:33] <ottomata>	 lemme collect the log output for hour 15 datasets that were just refined
[18:30:36] <joal>	 ottomata: you wish we move hour=15 refined data somewhere else?
[18:30:55] <ottomata>	 no joal will just look at log output of the refinie job
[18:30:56] <ottomata>	 and compare
[18:31:01] <ottomata>	 since it prints # of refnied records
[18:31:25] <joal>	 ack - works for me ottomata  - not complicated to move it if you prefer
[18:32:11] <ottomata>	 joal https://gist.github.com/ottomata/a1f5ba7bd538166d4686d247cad19a0d
[18:32:15] <ottomata>	 naw lets keep 15 in
[18:32:17] <ottomata>	 it will be interesting
[18:32:20] <ottomata>	 OH wait
[18:32:21] <ottomata>	 no
[18:32:22] <ottomata>	 wait
[18:32:23] <ottomata>	 yes!
[18:32:24] <ottomata>	 haha
[18:32:25] <ottomata>	 sorry yes
[18:32:25] <ottomata>	 .
[18:32:26] <ottomata>	 lets keep it in
[18:32:42] <joal>	 in, as is moved in a different place?
[18:32:50] <joal>	 so that we keep it?
[18:33:10] <ottomata>	 joal as in, just remove hour 14
[18:33:13] <ottomata>	 then do the directory swa
[18:33:19] <ottomata>	 event -> event_camus, event_gobblin -> event
[18:33:40] <ottomata>	 we can then launch refine and see what happens, i think it will re-refine all the hour 15s again but with the gobblin data
[18:33:45] <ottomata>	 and we can compare and see
[18:34:32] <joal>	 ok works for me ottomata 
[18:35:26] <ottomata>	 oh, i'll deploy gobblni job change for dir swap...
[18:36:21] <joal>	 !log Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events
[18:36:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:36:25] <wikibugs>	 (03PS1) 10Ottomata: Finalize gobblin migration for event_default job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704152 (https://phabricator.wikimedia.org/T271232)
[18:37:00] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Finalize gobblin migration for event_default job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704152 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[18:37:03] <joal>	 !log Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event
[18:37:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:37:19] <joal>	 Ok ottomata - move done
[18:37:28] <ottomata>	 k
[18:37:36] <ottomata>	 deploying ^
[18:38:06] <ottomata>	 joal going to launch manual refine again
[18:38:07] <ottomata>	 oh
[18:38:08] <ottomata>	 no
[18:38:14] <ottomata>	 gotta do puppet first
[18:38:14] <ottomata>	 with new config
[18:38:15] <ottomata>	 ok hang on
[18:38:22] <ottomata>	 that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/703869
[18:38:29] <ottomata>	 i'll merge that after refinery deploy done
[18:38:56] <joal>	 ack
[18:44:51] <ottomata>	 ok now starting refine
[18:44:52] <ottomata>	 lets see
[18:45:20] <ottomata>	 application_1623774792907_141718
[18:45:40] <ottomata>	 failed
[18:45:42] <ottomata>	 checking why
[18:46:04] <ottomata>	 mis config hang on
[18:50:34] <joal>	 ottomata: side note - can you please remind me if there is a "analytics-product" group for HDFS, with correct setup for people etc?
[18:51:18] <ottomata>	 i believe there is a system user, yes
[18:52:17] <ottomata>	 joal related
[18:52:17] <ottomata>	 https://phabricator.wikimedia.org/T285503
[18:53:17] <ottomata>	 ok refine running now
[18:53:17] <ottomata>	 application_1623774792907_141724
[18:53:59] <joal>	 Thanks for the link ottomata - can you confirm that product-analytics users are part of the group?
[18:54:11] <ottomata>	 they should be yes
[18:54:24] <joal>	 meh - actually -let's solve this at higher level with the task you mentionned
[18:54:27] <ottomata>	 i mean, maybe not all if they don't have ssh access?
[18:54:35] <joal>	 hm -right
[18:55:16] <ottomata>	 joal
[18:55:16] <ottomata>	 https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L788-L793
[18:55:42] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt)
[18:57:23] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure d...
[20:09:08] <ottomata>	 great stuff joal, event looks good!~
[20:09:23] <ottomata>	 eventlogging_legacy still to do, but this is not as easy as in the test cluster because we are not fully migratetd
[20:09:32] <ottomata>	 will prep some patches for tomorrow
[20:17:49] <wikibugs>	 (03PS1) 10Ottomata: Add gobblin job eventlogging_legacy [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232)
[20:18:27] <wikibugs>	 (03CR) 10Ottomata: Add gobblin job eventlogging_legacy (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata)
[20:21:18] <wikibugs>	 (03PS2) 10Ottomata: Add gobblin job eventlogging_legacy [analytics/refinery] - 10https://gerrit.wikimedia.org/r/704157 (https://phabricator.wikimedia.org/T271232)
[21:28:12] <wikibugs>	 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) >>! In T286065#7205088, @cmooney wrote: > @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently inc...