[02:34:04] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Translations? - https://phabricator.wikimedia.org/T287661 (10Sabeloga) Nice, thanks, much appreciated! :D
[06:24:05] <elukey>	 razzi: re:  > "restbase" is the aqs cassandra cluster, right?
[06:25:19] <elukey>	 razzi: Nope :) In the cassandra cluster's dashboard you need to select the analytics datasource (we have a separate prometheus instance) and then you'll see "aqs" as cluster (that should also be present among the option of the cassandra cookbook)
[06:25:47] <elukey>	 razzi: the restbase cluster is the one managed by SRE
[06:26:31] <elukey>	 for sre.aqs.roll-restart aqs
[06:26:49] <elukey>	 we use the canary basically to test safely the new druid mw history snapshot 
[06:27:14] <elukey>	 (so the cookbook depools one aqs node, restart nodejs and ask to the operator to test locally)
[06:27:38] <elukey>	 if you have doubts/etc.. ping me anytime! 
[07:58:14] <wences91>	 Hi, is the watchlist dump available? There is information about this table (https://www.mediawiki.org/wiki/Manual:Watchlist_table) but I cannot find the dump
[08:12:59] <majavah>	 wences91: watchlist contents are considered private, so contents of it are not available in any public dumps
[08:14:04] <wences91>	 ok, thanks! majavah
[09:53:29] <wikibugs>	 (03CR) 10Svantje Lilienthal: "This change is ready for review." [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal)
[10:07:07] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10BTullis) @Ottomata - FYI I spotted this on an-test-coord1001 this morning.  ` Warning: /Stage[main]/Profil...
[10:12:28] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) Here are some fragments the first puppet run on an-test-coord1001.eqiad.wmnet after the patch was merged. I'm concerned...
[10:17:30] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) The service was restarted automatically, but a Kerberos related error was generated in the logs. ` Jul 30 09:06:34 an-te...
[10:28:04] <elukey>	 btullis: o/ 
[10:28:20] <btullis>	 Hi elukey.
[10:28:33] <elukey>	 I think that http.server.authentication.krb5.principal-hostname may not be needed, Presto IIRC tends to be really upset for unused configs
[10:28:43] <elukey>	 does it work if you remove it manually and restart?
[10:28:51] <elukey>	 (curious now :P)
[10:29:23] <btullis>	 I will try now. I think that parameter was added in a version of presto later than ours. 
[10:29:26] <btullis>	 > http.server.authentication.krb5.principal-hostname was added in Presto 302
[10:29:33] <elukey>	 ahhh there you go
[10:29:47] <elukey>	 just to add confusion, there are two Presto out there
[10:29:59] <elukey>	 1) Prestodb (the one from facebook that we use)
[10:30:05] <elukey>	 2) PrestoSql, now called "Trino"
[10:30:22] <elukey>	 and they have completely different configs and docs
[10:31:45] <btullis>	 It appears that there is still an issue with the setting removed.
[10:31:48] <btullis>	 > Jul 30 10:30:53 an-test-coord1001 presto-server[45006]: 2021-07-30T10:30:53.277Z        ERROR        Announcer-2        com.facebook.airlift.discovery.client.Announcer        Service announcement failed after 37.13ms. Next request will happen within 1000.00ms
[10:32:49] <btullis>	 > uncer-0        com.facebook.airlift.discovery.client.Announcer        Cannot connect to discovery server for announce: Announcement failed for https://analytics-test-presto.eqiad.wmnet:8281
[10:33:22] <elukey>	 same error on an-test-presto1001
[10:33:48] <elukey>	 I bet that there is a TLS error
[10:33:54] <btullis>	 Ah right. I was looking at Trino. Will look again at the facebook one.
[10:34:34] <elukey>	 the /etc/presto/log.properties has INFO for logging, maybe DEBUG could give us more info, but it will spam a lot :)
[10:34:52] <btullis>	 Yes, if it's TLS it might be related to the permissions of the certificate files. Do you think I need to revert while I investigate, or  is it safe for me to work on the test cluster like this?
[10:35:00] <elukey>	 it is fine
[10:35:29] <elukey>	 going afk for lunch, have fun :)
[10:35:38] <btullis>	 Will do. Thanks.
[10:52:49] <wikibugs>	 (03CR) 10Awight: added template wizard sessions (038 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal)
[10:57:29] <btullis>	 Looks like the `sslcert::x509_to_pkcs12` didn't fire properly here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/presto/server.pp#110 
[10:57:56] <btullis>	 ...because /etc/presto/ssl/server.p12 didn't get recreated as it was supposed to and still contains only the puppet certificate.
[11:55:16] <btullis>	 The certificate wasn't generated because it failed the 'unless' test here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/sslcert/manifests/x509_to_pkcs12.pp#31 - i.e. it won't overwrite an existing valid keystore.
[12:05:31] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10MoritzMuehlenhoff) Some random thoughts here, some of those are wild guesses/wishful thinking since I haven't looked at krenew in detail yet :-)...
[12:47:24] <elukey>	 btullis: ahh nice! Does it work now??
[12:49:50] <btullis>	 Yes, I think so. The presto coordinator starts when I remove `http.server.authentication.krb5.principal-hostname`and when I manually executed the `openssl pkcs12` command that I had intended puppet to run.
[12:51:14] <btullis>	 However, I forgot to set `profile::presto::server::generate_certificate: true` on an-test-presto1001, so that is still trying to use the puppet certificate.
[12:51:19] <jbond>	 btullis: if there is an issue for x509_to_pkcs12 can yuo raise a bug, however im not sure ill be able to get to it today (which is my last day before vacasion)
[12:55:29] <btullis>	 Cool, thanks jbond: I can make an iterative patch to get presto working in the test cluster anyway, with a manual delete/move of the .p12 file. Would you prefer that I just raise a bug for x509_to_pkcs12 or have a go at a patch too? 
[12:56:17] <btullis>	 I don't think that there would be any hurry to merge it, so you could look at it when you're back anyway.
[12:58:05] <jbond>	 sure if yu want to have a go at the patch please do :)
[13:00:42] <btullis>	 Cool, thanks. I'll do the presto fixes first and check that everything else is OK with this method. When I create a bug report, do I tag it with Infrastructure Foundations?
[13:01:49] <jbond>	 if you tagg it puppet it should automaticly add the infratrsucture foundations one (you can of course also add it manually) also please add me
[13:02:59] <btullis>	 👍
[13:03:14] <addshore>	 If I was using kafka-main1001.eqiad.wmnet to look at jobs when we were in eqiad, any idea what host i should be using now we are in codfw?
[13:07:54] <btullis>	 addshore: I would guess at `kafka-main2001.codfw.wmnet`(from here: https://github.com/wikimedia/puppet/blob/HEAD/hieradata/common.yaml#L644)
[13:08:31] <wikibugs>	 (03PS3) 10Svantje Lilienthal: added template wizard sessions [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578)
[13:08:57] <addshore>	 btullis: thanks, that indeed looks right
[13:09:18] <btullis>	 a pleasure
[13:12:15] <wikibugs>	 (03CR) 10Svantje Lilienthal: "Thanks! I hope I got everything." (038 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal)
[13:46:09] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Ottomata) Oh thanks, cool, that's from when we moved this instance over to an-test-client for kerberos rea...
[13:58:03] <wikibugs>	 10Analytics, 10Analytics-Wikistats: wikistats: montly pageview dumps are not bz2 files - https://phabricator.wikimedia.org/T287684 (10Radim.kubacki) BTW: Parquet compression would be significantly more effective if the line was splitted into its parts, i.e. with fields for wiki code, article, pageId, type, cou...
[13:58:31] <btullis>	 OK, presto is working again on the test cluster, using the new CNAME alias and matching Kerberos principal.
[13:59:33] <elukey>	 \o/
[13:59:44] <elukey>	 does it work also with a simple query from an-test-client?
[14:00:29] <btullis>	 https://www.irccloud.com/pastebin/KVaS5rIm/
[14:00:47] <btullis>	 Think so.
[14:01:08] <elukey>	 yep just tested, all working!
[14:06:17] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) A few patches later, presto is working again in the test cluster.  We discovered that there is a peculiarity with the `s...
[14:06:56] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) I have verified that a simple query works from an-test-client1001.  ` btullis@an-test-client1001:~$ presto --catalog ana...
[14:10:12] <ottomata>	 btullis:  btw i was thikning...if you wanted to actually test failover in the test cluster, we could create a new ganeti vm to be an-test-coord1002
[14:11:14] <ottomata>	 addshore:  you can use either main codfw or eqiad
[14:11:19] <ottomata>	 all the topic data exists in both
[14:11:32] <ottomata>	 unless you are looking at consumers offset lag metrics or something
[14:11:46] <ottomata>	 you coudl also even use jumbo!  the topic data is there too
[14:12:11] <ottomata>	 https://wikitech.wikimedia.org/wiki/Kafka#Kafka_Clusters
[14:14:20] <wikibugs>	 10Analytics, 10Analytics-Wikistats: wikistats: montly pageview dumps are not bz2 files - https://phabricator.wikimedia.org/T287684 (10fdans) p:05Triage→03High My apologies for this. The intended format is bz2, not parquet. Clearly a miss of mine when configuring the job, looking into options to regenerate/...
[14:15:55] <btullis>	 ottomata: interesting. Yes I hadn't thought of that option. It would give us a means of testing some of the other cluster services as well.
[14:17:00] <btullis>	 I've been a big user of corosync/pacemaker in the past for HA services, but we don't use that at all here, do we?
[14:18:54] <elukey>	 one thing that I am wondering about presto is how a failover affects the workers
[14:19:29] <elukey>	 they do advertise periodically their presence to the query manager
[14:19:56] <elukey>	 so in case of a failover, the new query manager is probably unaware of workers
[14:20:12] <elukey>	 and needs to get some time to get up to speed
[14:20:28] <elukey>	 that is completely fine in my opinion, maybe we could figure out what is this timeframe
[14:20:34] <ottomata>	 elukey:  iirc (and i might not), all prestos could be query managers?
[14:21:19] <elukey>	 ottomata: no idea, but the workers need to advertise themselves anyway even if all are query managers
[14:21:38] <elukey>	 I don't recall any fixed list of presto workers
[14:22:22] <elukey>	 (this is why I was wondering about the failover)
[14:24:12] <ottomata>	 hm i think you are right
[14:24:15] <ottomata>	 "When a Presto worker process starts up, it advertises itself to the discovery server in the coordinator, which makes it available to the Presto coordinator for task execution.
[14:24:15] <ottomata>	 "
[14:24:19] <ottomata>	 https://prestodb.io/docs/current/overview/concepts.html
[14:25:15] <ottomata>	 althought hte 'disccovery server' can be run separately from the corodinator if needed
[14:25:16] <ottomata>	 https://prestodb.io/docs/current/installation/deployment.html
[14:27:28] <wikibugs>	 (03CR) 10Awight: "Looks right—please smoke test on a stat* server at your convenience." (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/709007 (https://phabricator.wikimedia.org/T287578) (owner: 10Svantje Lilienthal)
[14:28:15] <ottomata>	 yeah interesting elukey indeed what happens.... i guess the workers will just start registering themselves with the cname addy 
[14:28:41] <ottomata>	 hmmm maybe what we need is a forwarding or multipleing discovery addy!
[14:28:54] <ottomata>	 so taht both discovery servers on both coord nodes get all worker registrations
[14:30:28] <btullis>	 The only open source option I've found for HA with current versions of prestodb uses a proxy to forward to an active coordinator, with a standby: https://coding-stream-of-consciousness.com/2018/12/29/presto-coordinator-high-availability-ha/
[14:30:49] <btullis>	 But they still say "Any active queries at the time a coordinator fails will fail though – we can’t do anything about that unless Presto starts supporting HA internally."
[14:31:57] <ottomata>	 https://stackoverflow.com/questions/63701904/presto-coordinator-does-not-have-support-for-high-availabiltiy
[14:32:02] <ottomata>	 yeah active queries failing is fine
[14:32:07] <ottomata>	 we can't avoid that and it won't be a big deal
[14:35:45] <ottomata>	 interesting btullis  that setup sounds much more clean  but complicated than our dns cname thing
[14:35:54] <ottomata>	 our cname may be ok for our purposes
[14:36:01] <wikibugs>	 10Analytics, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Addshore)
[14:36:03] <ottomata>	 but it might need some testing of what luca is wondering
[14:36:04] <btullis>	 corosync/pacemaker is mentioned as a workable automatic failover mechanism here (in addition to the haproxy option): https://github.com/prestodb/presto/issues/3918#issuecomment-441196092
[14:36:18] <ottomata>	 what will the workers do when the discovery server changes?
[14:36:24] <ottomata>	 they shoudl just send traffic to the new coord
[14:36:28] <ottomata>	 and then it will see the workers
[14:36:31] <ottomata>	 but it might take several minutes
[14:36:39] <ottomata>	 which...with a dns change is true anyway
[14:36:51] <ottomata>	 btullis:  cool i'm not familiar with those, but they sound cool
[14:36:55] <ottomata>	 nginx woudl prrobably work too
[14:37:11] <ottomata>	 i think we have a few haproxy uses, def have nginx
[14:37:14] <ottomata>	 in wmf prod
[14:37:57] <ottomata>	 might be worth trying in test cluster with a new an-test-coord1002
[14:38:28] <btullis>	 ...but if we had a virtual IP that is associated with the CNAME, then during a failover corosync would migrate the VIP and the active presto server as a group. Wouldn't require a DNS change.
[14:38:54] <ottomata>	 btullis:  https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM
[14:39:19] <ottomata>	 btullis:  ya that woudl be faster than dns for sure.  what would be the method for failing over?
[14:39:24] <ottomata>	 manually?
[14:40:28] <ottomata>	 fwiw i don't think we need fast failover
[14:40:47] <ottomata>	 we just need something that allows dashboards and clients to work without changing a host addrress config
[14:40:59] <ottomata>	 its ok if running queries fail
[14:42:59] <btullis>	 With pacemaker one can do manual or automatic failover of resources. For manual, we can do for example: 'sudo crm_migrate -r presto_group an-test-coord1002`to migrate a group.
[14:43:45] <btullis>	 Or you could just take a cluster node offline, which migrates all resources away. 'sudo crm_node standby an-test-coord1001'
[14:43:54] <btullis>	 That sort of thing.
[14:47:36] <btullis>	 OK, I'll make a ticket to create an-test-coord1002 and we can assign it and talk about next steps during grooming on Monday, if you think that's a good idea. 
[14:50:54] <addshore>	 ottomata: any idea what might be up with https://stream-beta.wmflabs.org/v2/ui/#/?
[14:54:21] <addshore>	 also struggling to see the events im triggering in beta kafka
[14:54:25] <addshore>	 I see output in mw logs of `wikidatawiki 1.37.0-alpha EventBus DEBUG: Using destination_event_service eventgate-analytics-external for stream wd_propertysuggester.server_side_property_request.`
[14:54:28] <addshore>	 but nothing in kafka
[15:09:52] <addshore>	 ohia michaelcochez 
[15:13:08] <addshore>	 FYI this all relates to https://phabricator.wikimedia.org/T285098#7248774
[15:24:47] <wikibugs>	 (03CR) 10Mholloway: "> Patch Set 3:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway)
[15:31:45] <wikibugs>	 10Quarry: quarry-web-01 leaks files in /tmp - https://phabricator.wikimedia.org/T238375 (10Andrew) 05Open→03Resolved As of today the oldest files in /tmp are from the 26th, so I think tmpreaper is doing its job.
[16:14:54] <wikibugs>	 10Analytics, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Michaelcochez)
[18:14:12] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Translations? - https://phabricator.wikimedia.org/T287661 (10Sabeloga) 05Resolved→03Open Hi again, there seems to have been some errors when the translations were carried over. Some text that is translated on Translatewiki doesn't appear translated on site (like [[ https...
[18:29:00] <wikibugs>	 (03CR) 10Sharvaniharan: "@Ottomata @Michael Holloway would it be possible to merge this if you both are done with the review? I am trying to get it in, in this rel" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan)
[18:35:37] <wikibugs>	 (03CR) 10Ottomata: "Uhhhhh i dunno what I was thinking...when I read this code the first time I thought you were extracting a string field value from the retu" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway)
[18:35:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Add Refine transform function to add normalized host [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/705021 (https://phabricator.wikimedia.org/T251320) (owner: 10Mholloway)
[18:41:56] <wikibugs>	 10Analytics, 10EventStreams: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Ottomata) Ah, this was because I recommended to @Michaelcochez to use `+wikidatawiki` to add the stream config entries.  This is fine for MW c...
[18:44:03] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "+1 from me I'll let Michael merge." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan)
[18:44:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Oh, Michael already +1ed, merging." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan)
[18:45:07] <wikibugs>	 (03Merged) 10jenkins-bot: Migrate MobileWikiAppNotificationInteraction from legacy to MEP Bug: T287652 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/708597 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan)
[18:52:56] <ottomata>	 addshore: michaelcochez yt?
[18:53:07] <wikibugs>	 10Analytics, 10EventStreams, 10Patch-For-Review: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Ottomata) Nope.  I think mediawiki-config currently does not allow us to override default settings for beta.  Hm.
[18:56:30] <wikibugs>	 10Analytics, 10EventStreams, 10Patch-For-Review: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Ottomata) I think we need to either:  Declare the streams in both wikidatawiki and metawiki in InitialiseSettings-labs.p...
[18:56:48] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Metrics-Platform: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata)
[18:58:05] <wikibugs>	 10Analytics, 10EventStreams, 10Patch-For-Review: stream-beta.wmflabs.org seems broken (can't see my mediawiki-create events) or anything else - https://phabricator.wikimedia.org/T287760 (10Ottomata) @Michaelcochez @Addshore I'd go ahead and just move these configs to InitialiseSettings.php myself (nothing wi...
[19:17:15] <michaelcochez>	 ottomata: here now, checking the tasks.
[19:20:18] <michaelcochez>	 My understanding is that if the configuration is done in InitialiseSettings.php, then it is also applied in beta, as the InitialiseSettings-labs.php is only the additional parts, correct?
[19:20:35] <michaelcochez>	 In beta we are/should be ready to produce events.
[19:21:17] <michaelcochez>	 After we know the events are fine there, we plan to move to test. 
[19:24:51] <ottomata>	 yes
[19:25:14] <ottomata>	 michaelcochez: is the producer code only in beta, or is it already in prod?
[19:25:24] <michaelcochez>	 Only in beta for now.
[19:25:41] <ottomata>	 ok, then lets merge do it in  InitialiseSettings.php so you can test
[19:25:42] <ottomata>	 doing now
[19:26:18] <michaelcochez>	 As we need the events for the A/B testing, we want to make sure they work before moving to test/prod.
[19:33:48] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/709098
[19:35:08] <michaelcochez>	 lgtm. except for consistency of indentation :-)
[19:35:28] <michaelcochez>	 (spaces vs. tabs war ahead)
[19:36:15] <michaelcochez>	 side question: what is 'canary_events_enabled' => true, ? I just mimicked the examples without knowing what that is.
[19:39:55] <michaelcochez>	 Jenkins has chosen sides already it seems. He is keeping tabs on the formatting.
[19:40:57] <michaelcochez>	 20887 still has spaces.
[19:41:23] <michaelcochez>	 ^fixed now.
[19:47:13] <michaelcochez>	 One more question: does the kafka stream get created when defined in this config, or only once there are actual events generated?
[19:52:09] <michaelcochez>	 One more thing. The root volume of deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud is nearly 100% full. Not sure that matters.
[19:57:13] <ottomata>	 waiting for post merge to get it deployed in beta to check
[19:57:43] <ottomata>	 michaelcochez: re canary
[19:57:48] <ottomata>	 i should make some docs to link you but 
[19:57:48] <ottomata>	 https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#examples
[19:57:55] <ottomata>	 they are for monitoring
[19:58:06] <ottomata>	 we should switch that to being default true, but there are some streams we'd ahve to disable it for 
[19:58:14] <ottomata>	 i'll make a task for that
[19:58:45] <ottomata>	 https://phabricator.wikimedia.org/T251609
[19:59:36] <wikibugs>	 10Analytics, 10Event-Platform: Enable canary events for streams by default - https://phabricator.wikimedia.org/T287789 (10Ottomata)
[19:59:37] <ottomata>	 https://phabricator.wikimedia.org/T287789
[19:59:47] <wikibugs>	 10Analytics-Radar, 10Growth-Scaling, 10Product-Analytics, 10Growth-Team (Current Sprint): Growth: shorten welcome survey retention to 90 days - https://phabricator.wikimedia.org/T275171 (10Etonkovidova) 05Open→03Resolved
[19:59:49] <wikibugs>	 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10Etonkovidova)
[19:59:51] <michaelcochez>	 Clear. So in this stream we will also find these canary events once in a while. (good to know when w do the A/B analysis)
[19:59:53] <ottomata>	 kafka topics don't created until there are events produced to them
[20:00:01] <ottomata>	 they will be in the stream
[20:00:03] <ottomata>	 but not in the hive table 
[20:00:07] <ottomata>	 they are filtered out there
[20:00:23] <michaelcochez>	 Clear
[20:01:06] <michaelcochez>	 There was an older task for this it seems: https://phabricator.wikimedia.org/T266798
[20:01:27] <michaelcochez>	 6 linked already.
[20:01:54] <ottomata>	 yea  thats to enable for all and is complicated
[20:02:02] <ottomata>	 because there are many consumersr already that won't be expecting the canary events
[20:02:10] <ottomata>	 the one i made is just to make enabled the default
[20:02:17] <ottomata>	 but still disable for the complicated streams
[20:07:39] <ottomata>	 michaelcochez:  yeah / was full on that node
[20:07:45] <ottomata>	 dunno why, but i cleared some old logs  out
[20:21:33] <michaelcochez>	 "Post-merge build succeeded." So, I assume this might be there now? 
[20:25:42] <addshore>	 I'm on a boat now, but excited to see this moving :)
[20:26:56] <ottomata>	 michaelcochez:  yeah its there... but i think that disk full did cause some kafka issues
[20:26:58] <ottomata>	 resolving...?
[20:27:08] <ottomata>	 was about to say "here you go it works!!!! but now something else..."
[20:27:29] <michaelcochez>	 The volume for kafka itself seemed separate from /
[20:30:20] <ottomata>	 yeah
[20:30:32] <ottomata>	 your new topic was created, but did not look healthy 
[20:30:36] <ottomata>	 it was not assigned a kakfa broker leader
[20:30:39] <ottomata>	 i do not know why
[20:30:48] <ottomata>	 i deleted it and am trying to recreate it, but things seem a little stuck
[20:30:56] <ottomata>	 also this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/709103
[20:31:06] <ottomata>	 the removal of that caused the stream-beta thing to not work right
[20:31:15] <ottomata>	 it tried to subscribe to topics that didn't exit
[20:31:17] <ottomata>	 exist
[20:36:50] <ottomata>	 grr beta things always seem to get stale and stop working after long times
[20:37:06] <ottomata>	 i don't have a lot of time to totally figure out whats wrong with kafka atm
[20:37:15] <ottomata>	 i could wipe it in beta and start clean, no one would meind there...
[20:37:15] <ottomata>	 ghm
[20:37:23] <ottomata>	 ya going to do that
[20:45:51] <michaelcochez>	 kafkacat -L -b deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud | grep propertysuggester
[20:45:51] <michaelcochez>	   topic "eqiad.wd_propertysuggester.client_side_property_request" with 1 partitions:
[20:45:51] <michaelcochez>	   topic "eqiad.wd_propertysuggester.server_side_property_request" with 1 partitions:
[20:45:51] <michaelcochez>	 8-)
[20:49:21] <ottomata>	 ya better after wiping
[20:49:26] <ottomata>	 but still your events are not coming through!
[20:49:28] <ottomata>	 gRrRR
[20:49:34] <ottomata>	 i can produce an event manually via curl through eventgate
[20:49:42] <ottomata>	 and i can force browser to send
[20:50:50] <michaelcochez>	 I see events.
[20:51:29] <ottomata>	 oh wait
[20:51:30] <ottomata>	 they are!
[20:51:33] <ottomata>	 i was looking at server side doh
[20:51:37] <ottomata>	 oh great!!!!
[20:51:44] <michaelcochez>	 Server side should be there as well...
[20:51:59] <ottomata>	 do you see them in eventstreams ui?
[20:52:04] <ottomata>	 at stream-beta?
[20:52:12] <michaelcochez>	 kafkacat -C -b deployment-kafka-jumbo-2.deployment-prep.eqiad1.wikimedia.cloud -t eqiad.wd_prertysuggester.server_side_property_request
[20:52:15] <ottomata>	 aye
[20:52:53] <ottomata>	 ok i don't know why i cant see thtem in  stream-beta
[20:52:59] <ottomata>	 but michaelcochez  i have to run!
[20:53:09] <ottomata>	 ok if we wait til monday to fix that bit?
[20:53:12] <ottomata>	 i think you can test now ya
[20:53:27] <ottomata>	 btw, if you have issues producing, you can POST events to
[20:53:27] <ottomata>	 http://deployment-eventgate-3.deployment-prep.eqiad.wmflabs:8492/v1/events
[20:53:27] <michaelcochez>	 We can look at them with kafkacat. Works for now.
[20:53:31] <michaelcochez>	 Thanks a million
[20:53:35] <ottomata>	 without the ?hasty=true bit
[20:53:38] <ottomata>	 and it iwll return the error
[20:53:40] <ottomata>	 e.g.
[20:53:50] <ottomata>	 cat e.json
[20:53:50] <ottomata>	 {"$schema":"/analytics/mediawiki/wd_propertysuggester/client_side_property_request/1.0.0","dt":"2021-07-30T20:42:54.748Z","entity_id":"Q393194","event_id":"162765789408917eb23f8","meta":{"stream":"wd_propertysuggester.client_side_property_request","domain":"wikidata.beta.wmflabs.org"},"num_characters":2,"session_id":"YQQT2X7pLd-EyzjN7IWIAAAAAFI","user_id":"17eb23f8"}
[20:53:56] <ottomata>	 curl -H 'Content-Type: application/json' -X POST -v -d@e.json 'http://deployment-eventgate-3.deployment-prep.eqiad.wmflabs:8492/v1/events'
[20:53:59] <ottomata>	 ok laterz!
[22:21:48] <razzi>	 Doing the roll restart on the druid test cluster (1 node)
[22:21:50] <razzi>	 sudo cookbook sre.druid.roll-restart-workers test
[22:22:17] <razzi>	 !log razzi@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers test
[22:22:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:22:35] <razzi>	 Will do the actual nodes on Monday :)