[00:21:32] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:59:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10razzi) SGTM @BTullis hope you can figure it out Here's my understanding of what could be timing out ===== client ===== - javascript makes ajax request... [02:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [07:29:48] hello folks [07:29:55] I am going to move kafka test to the new fixed uid/gid [07:29:56] Good morning elukey [07:30:00] bonjour :) [07:30:20] fixed uids all over the place \o/ [07:32:26] in theory it is a little pain now, but we'll reimage brokers way more easily with it [07:33:09] (I am working on upgrading kafka-main to buster) [07:33:13] Yes, I remember the arguments and actions taken when ou did it for Hadoop [07:33:26] !log move kafka-test to fixed uid/gid [07:33:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:46:49] 10Data-Engineering, 10observability, 10serviceops: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) [07:47:33] 10Data-Engineering, 10observability, 10serviceops: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) [07:48:02] 10Data-Engineering, 10observability, 10serviceops: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) [07:48:34] opened a task :) [07:49:07] 10Data-Engineering, 10observability, 10serviceops: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) [07:59:05] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Adding some Data-Engineering considerations, with the assumption that the new `sflow` stream would be comparable to... [08:08:34] Hi btullis, elukey, joal: I have just read your messages yesterday about the job causing problems to the cluster, thanks for noticing and sorry for the inconvenience. joal already contacted me to reduce pressure on shufflers [08:09:00] Hi elaragon - shall we talk here instead of by email? [08:09:20] as you wish, both are fine to me [08:10:40] elukey: I think the problem comes from having a lot of data to shuffle (Tbs of text) and also skewed data on the smaller data side (group by wiki, page_id is extremely skewed toward certain pages containing really a lot of revisions) [08:10:55] woops sorry elukey - missed my ping - I meant elaragon --^ [08:11:42] nono very interesting :) [08:12:10] elaragon: I have a solution to offer: instead of doing the work of selecting that latest revision on the cluster, use the dumps containing that only (wmf.mediawiki_wikitext_current) [08:12:22] joal: the group by that you mentioned mean that some nodes gets more data via shufflers, ending up in more pressure on the heap etc.. right? [08:15:51] I think you're right elukey [08:15:56] <3 [08:16:17] makes sense, thanks :) [08:17:37] elukey: I don't have a precise explanation as to why the shuffler breaks - too much data written,too much data read, both - but it's something in there [08:27:55] joal: IIRC the shuffle data is stored as files (temporarily) on disk right? If there is a ton of data/files that the external shuffle service needs to track, then it is easy to fill up the heap (this is my understanding) [08:29:00] with Spark 3 we'll have metrics for the external shuffle service, I think those will reveal a lot of interesting things :) [08:29:11] Yes!! [09:27:59] > I am going to move kafka test to the new fixed uid/gid [09:27:59] elukey: Great work! Are you planning to apply the change yourself to all Kafka clusters in T296982 ? Feel free to let me know if you'd like any help with anything. [09:28:00] T296982: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 [09:28:44] btullis: thanks! If you want to take care of Jumbo I'd be super happy, I'll surely do Kafka main and in case Logging (if observability is busy) [09:28:54] there is no real rush, it is just to ease future upgrades [09:29:02] with the hiera flags we can do it anytime [09:29:03] :) [09:29:34] OK, cool. I'll happily do Jumbo. [09:36:01] super thanks :) [10:16:26] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Puppet: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) 05Resolved→03Open ` from: SYSTEMDTIMER to: root@cumin2001.codfw.wmnet Output of systemd t... [10:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:39:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, 10serviceops: Move kafka-jumbo to a fixed uid - https://phabricator.wikimedia.org/T296990 (10BTullis) [10:39:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, 10serviceops: Move kafka-jumbo to a fixed uid/gid - https://phabricator.wikimedia.org/T296990 (10BTullis) [10:41:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, 10serviceops: Move kafka-jumbo to a fixed uid/gid - https://phabricator.wikimedia.org/T296990 (10BTullis) p:05Triage→03Medium [11:46:34] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) [11:47:45] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) The transfer of these four snapshots completed successfully: ` btullis@aqs1010:~$ sudo du -sh /srv/cassandra-... [11:50:48] joal: Do you think it best two wait until compactions have finished on aqs_next, before reloading the four `mediarequest_perfile` snapshots? [11:50:51] https://usercontent.irccloud-cdn.com/file/RgEKjkvi/image.png [13:22:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_11 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_public - https://alerts.wikimedia.org [13:22:18] (DruidSegmentsUnavailable) firing: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_11 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_public - https://alerts.wikimedia.org [13:42:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2021_11 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_public - https://alerts.wikimedia.org [13:42:18] (DruidSegmentsUnavailable) resolved: More than 5 segments have been unavailable for mediawiki_history_reduced_2021_11 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_public - https://alerts.wikimedia.org [13:42:54] (03CR) 10Gehel: [C: 03+1] "LGTM" [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [13:43:00] --^ This alert is interesting. Looks like it must have been caused by this job: mediawiki-history-reduced-coord job. Would that be right? [13:45:15] The dashboard link is wrong, for one thing. It adds `var-cluster` twice. But again it looks like the alert is just being too strict. It's firing on segments which are still in the process of being loaded as fast as possible. [13:46:18] (03PS3) 10Gehel: exclude conflicting dependencies [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [13:46:30] (03CR) 10Gehel: [C: 03+1] "LGTM" [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/730282 (owner: 10ODimitrijevic) [14:23:39] I think you're right btullis [14:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:39:20] btullis: that makes sense, the history job finished last night and the history-reduced job is one of the bigger load operations I think (I'm not sure how it compares to NetFlow or other big datasets I have less experience with) [14:48:29] milimetric: mw-history-reduced and edit-hourly are the only jobs reloading full datasets - other are incremental [14:50:11] ah, makes sense, it's the "this isn't how Druid is supposed to work" use cases [14:51:21] correct sir [15:36:15] o/ I wonder what's the best way to push data from a spark job to kafka-main (eqiad or codfw), my data should be small so I'm pondering between: [15:36:52] 1/ collect data to the spark driver and simply use eventgate [15:37:45] 2/ use spark -> kafka connector (e.g. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) [15:43:17] context is T279541, tl/dr discrepancies are detected using a spark job and "reconciliation" events are "re-injected" into the pipeline [15:43:18] T279541: Add a reconciliation strategy to the wdqs streaming updater - https://phabricator.wikimedia.org/T279541 [17:06:09] Hi dcausse - the spark-kafka connector is spark-streaming, so probably not what you're after - The third way I can imagine is to make Spark executors send their portion of the data [17:15:13] heya btullis or razzi [17:16:03] I have just pushed a patch to bump aqs druid datasource:https://gerrit.wikimedia.org/r/c/operations/puppet/+/743458 [17:16:15] hi joal [17:16:18] it'd be great if we could have it deployed and AQS restarted before the weekend [17:17:07] That sort of makes intuitive sense but it also goes against my friday deploy intuition, care to explain the timing joal ? [17:18:35] Oh ok this is the mediawiki_history one, very small risk [17:18:47] Yeah I'll deploy it rn [17:19:04] sure razzi - the druid datasource change is to make data available to users, and the process is well known and data vetted, so the friday rule doesn't really apply here (even if I fully support it on a general bases) [17:19:13] Thanks razzi. [17:19:19] thanks a lot razzi :) [17:19:23] (rn = right now) [17:21:12] So I'm about to run: `sudo cookbook sre.aqs.roll-restart aqs` [17:22:15] and in the docs - https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Is_there_a_new_Mediawiki_History_snapshot_ready?_(beginning_of_the_month) - we have "Notice there are 3 sqoop jobs, one for mediawiki-history from labs, one for cu_changes (geoeditors), and one for sqooping tables like actor and comment from the production replicas. " [17:22:20] is that still up to date? [17:22:41] My understanding is that gobblin is eating up sqoop's job [17:23:51] Also I see aqs1004 is debian 9! Let's upgrade [17:23:58] We also want to restart the `aqs` service on aqs101* - That isn't covered by the cookbook. Something like `sudo cumin aqs101* 'systemctl restart aqs` [17:24:16] We are nearly finished upgrading :-) [17:25:27] razzi: on an-launcher1002 there are several refinery-sqoop-* units [17:25:33] All of the aqs100* servers are going away, as soon as we finish the migration of all of the data to aqs101*. Those servers are currently called aqs_next but hopefully by the end of next week we will have finished migrating. [17:26:11] gotcha [17:26:16] Are they pooled? [17:26:46] Nope, not yet. [17:26:54] cool, no problem to restart all at once then :) [17:26:58] https://config-master.wikimedia.org/pybal/eqiad/aqs [17:27:02] razzi: --^ [17:27:11] I'm not sure that I'm testing the canary properly, but here's the curl command I did: [17:27:41] razzi@aqs1004:~$ curl https://wikimedia.org/apr-types/all-page-types/monthly/2021090100/2021120500 [17:28:00] The idea being to include the month of november (202111xxxx) [17:28:19] correct razzi [17:28:21] But I got the following results: [17:28:23] `"results":[{"timestamp":"2021-09-01T00:00:00.000Z","edits":39035754},{"timestamp":"2021-10-01T00:00:00.000Z","edits":40499735}]` [17:28:30] No november :( [17:29:20] So right now I'm prompted by `Please test aqs on the canary.` - I don't think it'd be a problem to roll out the change everywhere, but there might be some data missing [17:30:11] razzi: is the link above correct? [17:30:14] hm that's weird razzi [17:30:22] it doesn't mention localhost, and it leads to 404 to me [17:30:29] how is the curl supposed to run? [17:31:22] oh huh [17:31:43] I copied the wrong curl :) [17:32:11] The right one is `curl http://localhost:7232/analytics.wikimedia.org/v1/edits/aggregate/all-projects/all-editor-types/all-page-types/monthly/$(date --date "last month" "+%Y%m0100")/$(date "+%Y%m0100")` [17:32:21] and that comes back with November :) [17:32:43] \o/ [17:33:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10BTullis) Thanks @razzi. I think, but I'm not 100% sure, that it is ATS that is dropping the connection. I used the superset staging environment through a... [17:35:50] Oops forgot to log the cookbook [17:36:21] !log restart aqs to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cookbook sre.aqs.roll-restart aqs` [17:36:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:36:51] !log restart aqs-next to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cumin A:aqs-next 'systemctl restart aqs'` [17:36:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:37:04] Ok it is done [17:46:45] milimetric: Missed the notification on your message yesterday re: deploying wikistats, my bad! I tried to deploy it with ottomata but there was an error in npm run build [17:49:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10razzi) Interesting @BTullis! I never got more than 65 seconds from the perspective of the client (dev tools) so I think ATS might be timing out at 120 se... [17:51:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10razzi) On that same article you linked, I see a clue: {F34822783} [17:55:10] Hey, the Wikimedia Stewards would like to know which blocks impact the most users (so we can be especially careful about them). Would it make sense to add a field to the x-analytics header that would contain the ID of a global block, if there's any? [17:56:26] Alternative solution could be to emit an event when an edit is attempted and prevented by a block. This will not count users who don't click the "edit" (or maybe view source? Not sure how it looks for the blocked ones) button, but who would be affected when they try editing [17:57:43] Yet another solution could be trying to rebuild the global block history, and calculate the field from the existing data, but i think this would be easier to realize. [17:58:11] Hi urbanecm - I'm not sure I understand what ou're trying to gather here [17:58:36] urbanecm: by impacting, do you mean preventing someone to edit? [17:59:58] joal: correct [18:01:28] urbanecm: there might be a way to have this information from webrequest by looking at request parameters for edit actions, but this feels a lot less precise than having events generated on the special cases [18:02:12] If we go for webrequest, we have 90 days of data at most (not much) [18:02:13] the issue is I'd need to take https://meta.wikimedia.org/wiki/special:log/gblblock and convert it into something that tells me when were all the blocks active (taking expiration _and_ unblocks into accounts) [18:02:55] this --^ is if you use webrequest, right? [18:03:14] yeah [18:03:47] hm - IIRC we try to do something like that in mediawiki_history, but it's not perfect and I think it doesn't include global blocks [18:04:22] urbanecm: you can have a look at user-blocks data in mediawiki_history, maybe it's precise enough for you? [18:04:50] if it doesn't include global blocks, it's not sufficient unfortunately [18:05:18] urbanecm: nonetheless, I still suggest to spend some devlopement time in having events, it'll be precise and will surely be useful on the long run [18:05:25] rig [18:05:45] right - global blocks are stored in a table we don't yet process for the mediawiki-hsitory :( [18:06:15] so, to summarize the discussion, your recommendation would be to create an event that's fired when someone tries to edit from a gblocked IP? [18:07:13] (and is it a good idea to try to reuse schemas, or is it better to just add a new one?) [18:07:20] absolutely yes - I can even think that more broadly, any blocked edit could generate an event - this would help track them and possibly do some auditing [18:08:51] urbanecm: side-note - I don't feel confident in you making the decision on me giving that advise :) having milimetric opinion would comfort me ;) [18:09:47] eh, I'm not going to just do it tomorrow :). I'm thinking about it loud, more or less. [18:10:08] About schemas, I assume it'd be worth creating a new one, but this would better be discussed with ottomata (off today) and possibly product people interested in events (there is a program tp develop generic events, I don't know how advanced they are now) [18:10:18] cool urbanecm :) [18:11:15] Ok gone for tonight folks - Have a good weekend! [18:11:21] thanks again razzi for the friday deploy :) [18:11:33] see you later joal! [18:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:36:38] razzi: ok, that's probably an npm version thing. There's a docker method, but I'll do the deploy later today, no worries [18:36:51] Let's make the docker method better! [18:37:05] There's a todo there to get rid of this tech debt, you're welcome to give it a shot and see what it needs [18:37:16] (in theory it should just work, but... docker and theory don't get along) [18:37:22] Ok yeah let me take a stab at formalizing the docker steps into a Dockerfile [18:41:16] joal / urbanecm : I definitely think a custom event is ideal, as this is an important top level kind of interaction. Getting this kind of data from the mw database will be less and less how we do things, strategically speaking, in my opinion [18:49:17] 10Analytics, 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) [18:49:28] 10Analytics, 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) Servers added to netbox [19:09:01] 10Analytics, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10herron) p:05Triage→03Medium [19:19:03] 10Data-Engineering, 10Data-Services, 10Patch-For-Review, 10User-bd808, 10cloud-services-team (Kanban): user_properties_anon view not being created/maintained consistently on wikireplicas due to lack of meta_p in all sections - https://phabricator.wikimedia.org/T294652 (10Andrew) a:05bd808→03AntiCompos... [21:11:43] 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, 10WikiEditor, and 2 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10odimitrijevic) [21:17:29] 10Analytics, 10Data-Engineering, 10Product-Analytics: A few alterblocks events have event_timestamps from before 2001 - https://phabricator.wikimedia.org/T218824 (10odimitrijevic) [21:19:13] 10Analytics, 10Data-Engineering: Set entropy alarm in editors per country per wiki - https://phabricator.wikimedia.org/T227809 (10odimitrijevic) [21:19:47] 10Analytics, 10Data-Engineering: Set entropy alarm in editors per country per wiki - https://phabricator.wikimedia.org/T227809 (10odimitrijevic) p:05Medium→03Low [21:21:03] 10Analytics, 10Data-Engineering: page_id is null where it shouldn't be in mediawiki history - https://phabricator.wikimedia.org/T259823 (10odimitrijevic) p:05Medium→03Low [21:25:24] 10Data-Engineering, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): user_properties_anon view not being created/maintained consistently on wikireplicas due to lack of meta_p in all sections - https://phabricator.wikimedia.org/T294652 (10AntiCompositeNumber) 05Open→03Resolved a:05AntiCom... [21:54:29] PROBLEM - Hadoop NodeManager on an-worker1128 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:58:43] 10Data-Engineering, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10odimitrijevic) [22:06:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10odimitrijevic) [22:11:39] 10Analytics-Clusters, 10Data-Engineering: Move the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10odimitrijevic) [22:16:03] RECOVERY - Hadoop NodeManager on an-worker1128 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:17:51] 10Analytics-Clusters, 10Data-Engineering: Enforce authentication for Kafka Jumbo Topics - https://phabricator.wikimedia.org/T255543 (10odimitrijevic) p:05Triage→03Low [22:18:11] 10Analytics-Clusters, 10Data-Engineering: Upgrade Druid to latest upstream (> 0.20.1) - https://phabricator.wikimedia.org/T278056 (10odimitrijevic) [22:18:36] 10Analytics-Clusters, 10Data-Engineering: Upgrade Druid to latest upstream (> 0.20.1) - https://phabricator.wikimedia.org/T278056 (10odimitrijevic) p:05Triage→03Low [22:22:21] 10Analytics-Clusters, 10Data-Engineering: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10odimitrijevic) [22:22:25] 10Analytics-Clusters, 10Data-Engineering: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10odimitrijevic) p:05Medium→03Low [22:22:53] 10Analytics-Clusters, 10Data-Engineering: Verify if Turnilo can pull data from Druid using Kerberos/TLS - https://phabricator.wikimedia.org/T250485 (10odimitrijevic) [22:24:35] 10Analytics-Clusters, 10Data-Engineering: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10odimitrijevic) [22:25:12] 10Analytics-Clusters, 10Data-Engineering, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10odimitrijevic) [22:26:36] 10Analytics-Clusters, 10Data-Engineering, 10Data-Persistence-Backup: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10odimitrijevic) [22:27:17] 10Analytics-Clusters, 10Data-Engineering: Enforce authentication for Druid datasources - https://phabricator.wikimedia.org/T255545 (10odimitrijevic) [22:29:31] 10Analytics-Clusters, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10odimitrijevic) @razzi is this still relevant? [22:29:44] 10Analytics-Clusters, 10Data-Engineering, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10odimitrijevic) [22:31:58] 10Analytics-Clusters, 10Analytics-Radar, 10SRE: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10odimitrijevic) [22:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:37:36] 10Analytics-Clusters, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10odimitrijevic) [22:43:09] 10Analytics-Clusters: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337 (10odimitrijevic) p:05Low→03Lowest Unless the size of the repository is becoming an issue I am inclined to decline this task. [22:43:52] 10Analytics-Clusters, 10Data-Engineering, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10razzi) Yes, superset still uses firejail: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/pro... [22:52:19] 10Analytics-Clusters, 10Data-Engineering: hdfs password file for mysql should be re-generated when the password file is changed by puppet - https://phabricator.wikimedia.org/T170162 (10odimitrijevic) [22:52:42] 10Analytics-Clusters, 10Data-Engineering: hdfs password file for mysql should be re-generated when the password file is changed by puppet - https://phabricator.wikimedia.org/T170162 (10odimitrijevic) p:05Low→03Lowest [22:53:43] 10Analytics-Clusters, 10Data-Engineering: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10odimitrijevic) [22:54:13] 10Analytics-Clusters, 10Data-Engineering: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10odimitrijevic) p:05Medium→03Low [22:58:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10odimitrijevic) [23:23:46] 10Data-Engineering, 10Data-Services, 10User-bd808, 10cloud-services-team (Kanban): user_properties_anon view not being created/maintained consistently on wikireplicas due to lack of meta_p in all sections - https://phabricator.wikimedia.org/T294652 (10bd808) Thanks for the work to actually roll this out @a... [23:45:06] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Observability-Alerting: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 (10odimitrijevic) p:05Triage→03Medium [23:47:13] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10odimitrijevic) p:05Triage→03High [23:47:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Set up scap deployment - https://phabricator.wikimedia.org/T295380 (10odimitrijevic) p:05Triage→03High [23:48:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10odimitrijevic) [23:50:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10Phabricator: Herald rule for Data-Engineering - https://phabricator.wikimedia.org/T295397 (10odimitrijevic) Thanks @Milimetric. I moved the Analytics-Clusters tasks to the DE board and can be deprecated. I updated the task accordingly. Agreed on cleaning the #...