[03:08:35] 06Data-Engineering, 06Structured-Data-Backlog: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688#9652699 (10Cpetrillo) Hi folks - chiming in here from the Enterprise side. @fkaelin is correct that our snapshots are not historical. We likely will not be able to support that piece o... [05:57:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:07:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:58:56] 06Data-Engineering, 10Observability-Logging, 06Traffic, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9653079 (10gmodena) [08:12:21] My team is planning to run a few (very) long jobs on stat1009 over the next several weeks. We intend to use c. 16 CPU cores. [08:12:26] 06Data-Engineering, 10Observability-Logging, 06Traffic, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9653099 (10gmodena) > These are the fields that are sent from Benthos that aren't present in the current webrequest stream:... [08:13:35] One thing I haven't figured out yet is whether there's a convenient local or remote volume where we can output ~3.4GB of data which is shared-writeable between several users (fine-grained access control not needed). [08:14:16] 06Data-Engineering, 10Event-Platform, 13Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956#9653101 (10gmodena) Tagging {https://phabricator.wikimedia.org/T360642} [08:14:24] Maybe we'll write directly to a /srv/published subdir, since that's the final destination anyway. [08:31:10] (03PS7) 10Santiago Faci: Update the WikiLambda instrumentation to use core interaction events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) [08:42:32] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Monitor the availability of the superset deployments - https://phabricator.wikimedia.org/T356484#9653129 (10Gehel) [08:44:06] 10Quarry, 10ChangeProp, 06collaboration-services, 10GitLab, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9653141 (10Jelto) [08:46:00] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Update the From: addresses of all email from DPE pipelines so that they use routable addresses - https://phabricator.wikimedia.org/T358675#9653154 (10Gehel) [08:46:40] 06Data-Engineering, 10Dumps-Generation, 06SRE, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9653160 (10Gehel) [08:51:18] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895#9653232 (10Gehel) [08:52:05] (03PS8) 10Phuedx: Update the WikiLambda instrumentation to use core interaction events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [08:55:26] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 13Patch-For-Review: Cleanup superset related resources from puppet - https://phabricator.wikimedia.org/T358570#9653273 (10Gehel) [08:55:31] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 07Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9653271 (10Gehel) [08:56:09] 06Data-Engineering, 10Observability-Logging, 06Traffic, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9653275 (10Fabfur) >>! In T360642#9653099, @gmodena wrote: >> These are the fields that are sent from Benthos that aren't p... [08:57:55] 06Data-Engineering, 10superset.wikimedia.org, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Superset Timeout Logging - https://phabricator.wikimedia.org/T294772#9653292 (10Gehel) [08:59:48] 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 06SRE, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9653302 (10Gehel) [09:34:36] 06Data-Engineering, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999#9653388 (10AndrewTavis_WMDE) @nshahquinn-wmf, @xcollazo: checking in on this one again. I would have some time in the next... [09:53:20] (03CR) 10Phuedx: "A couple of minor points inline." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [10:11:28] 06Data-Engineering, 06Discovery-Search, 06Java-Scala-Standardization, 10Metrics Platform Backlog, and 3 others: Adapt gitlab pipelines for the new wmf-jvm-parent-pom - https://phabricator.wikimedia.org/T358841#9653451 (10Gehel) [10:12:51] 06Data-Engineering, 06Discovery-Search, 06Java-Scala-Standardization, 10Metrics Platform Backlog, and 3 others: Adapt gitlab pipelines for the new wmf-jvm-parent-pom - https://phabricator.wikimedia.org/T358841#9653462 (10Gehel) a:03Gehel [10:35:20] brouberol, btullis: I was just looking at an-test-client1002 failed puppet run (alerting for a while in #wikimedia-data-platform-alerts). It seems that `userdel` cannot remove a user as there is a systemd/pam process still running. [10:35:43] I'm not up to date enough on how that works. Is it safe for me to just kill it and restart puppet? [10:40:28] gehel: Yes, you can just kill that process. It is something that the I/F team attempts to do during their offboarding scripts, but it has been known to leave some lingering processes around. [10:40:47] Thanks for dealing with the alerts too :-) [10:41:10] my pleasure! [10:41:50] having a dedicated channel for alerts makes it slightly easier to parse for me. [10:42:40] we've had a few alerts lingering for a bit too long lately, so I try to do my part to stay on top of those... [10:43:03] * btullis Agree, thanks. [10:44:07] !log shut down an-worker1168 to investigate disk controller failure for T360594 [10:44:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:44:11] T360594: an-worker1168 in a weird statue, possibly due to I/O errors - https://phabricator.wikimedia.org/T360594 [10:47:47] (03PS9) 10Santiago Faci: Update the WikiLambda instrumentation to use core interaction events [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) [12:58:03] 06Data-Engineering, 10superset.wikimedia.org, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): Superset Timeout Logging - https://phabricator.wikimedia.org/T294772#9653714 (10BTullis) I'm not 100% sure that this ticket is necessary any more. For context, it was created at a time when we had only 5 presto worke... [12:59:52] 06Data-Engineering, 06Data Products, 06Data-Platform, 06Movement-Insights: Wikistats "Active Editors by Country" does not follow definition for active editors - https://phabricator.wikimedia.org/T360073#9653756 (10CMyrick-WMF) > * Does this relate to a current OKR? >> I'm not sure if this relates to a curr... [13:00:56] (03PS1) 10Cparle: sqoop the data from the machinevision tables before they're dropped [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1013531 (https://phabricator.wikimedia.org/T352884) [13:01:00] (03PS2) 10Cparle: [DNM] sqoop the data from the machinevision tables before they're dropped [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1013531 (https://phabricator.wikimedia.org/T352884) [13:01:08] (03CR) 10Cparle: [C:04-1] "-1 because this isn't intended to be merged - it's a temporary patch for sqooping some data from tables that will be dropped soon" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1013531 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [13:12:49] 06Data-Engineering, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802#9653901 (10BTullis) [13:15:37] 06Data-Engineering, 06Data-Platform-SRE, 06Infrastructure-Foundations, 06SRE: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802#9653931 (10BTullis) I'm planning to carry out {T358196} shortly, which I believe may have a beneficial impact on this ticket. I'll not merge them... [13:17:44] !log `elukey@cumin1002:~$ sudo cumin 'stat100[4,5,8,9]*' 'kill `pgrep -u kcv-wikimf`'` to unblock puppet on various stat nodes [13:17:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:18:06] folks puppet on various stat nodes was broken since days ago :) [13:18:46] elukey: Oh sorry, I didn't spot it. This is my first day back in a while. [13:18:52] helloooo [13:19:02] no problem I figured, just mentioned in here :) [13:20:40] Ah that running processes check. I think that m.oritz was working on that a while ago as part of improving the offboarding script. [13:22:42] * elukey nods [13:23:06] one question - who manages cassandra on aqs at the moment? Data Platform or Persistence? [13:23:21] there are some warnings for cert expiring, I am wondering if we could migrate to PKI [13:23:48] Data Persistence looks after all of these machines now. [13:23:59] perfect will ping them thanks :) [13:24:03] It seems like a good idea though. [13:24:25] everything is already in place, I opened https://phabricator.wikimedia.org/T352647 a while ago and this seems to be a good occasion [13:26:31] Nice. I know that there is also a push to remove any legacy cergen certificates and move them to PKI. T357750 [13:26:31] last one I forgot: IIUC aqs 1.0 (the nodejs on the aqs nodes) is not used anymore, and there is aqs 2.0 running somewhere (I guess on k8s?) [13:26:32] T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750 [13:26:57] TIL thanks! [13:27:08] I know that's not the case here, these are just exposed puppet CA certificates on Cassandra, but worth knowing about. [13:29:47] You're correct the nodejs based AQS 1.0 on the aqs* hosts is not used at all. There are some active cleanups happening at the moment, but I haven't been very involved. [13:31:07] And Yes, AQS 2.0 is all k8s - Currently six separate services on wikikube: https://wikitech.wikimedia.org/wiki/AQS_2.0 [13:31:40] They still use the same Druid/Cassandra clusters for serving data, but are no longer co-located with Cassandra. [13:45:42] elukey: This is the cleanup for AQS 1.0 - T358793 [13:45:43] T358793: Decommission AQS 1.0 - https://phabricator.wikimedia.org/T358793 [13:46:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:47:47] btullis: ack thanks! [13:57:06] (03CR) 10Phuedx: Update the WikiLambda instrumentation to use core interaction events (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/992224 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [14:14:29] btullis: brouberol and myself undeployed aqs1 yesterday, BTW [14:14:44] there's only a few cleanups left, the service itself is gone [14:16:23] 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 06SRE, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9654089 (10brouberol) Starting today (at least for the `staging-codfw` and `dse-k8s-eqiad` clusters), apps running in Kubernetes we can use... [14:17:43] speaking of moritzm, let's pair on monday to fully decom the realservers, IPVS entry and VIP on monday, if that's alright? I'm always nervous when interacting with PyBal [14:19:53] brouberol: we can pair up for this! but I'm off next week, so we can it the week after the Easter weekend. given these are just cleanups, there's no real rush anyway [14:20:55] ack. I'm not sure everyone feels that way, as we have quite a few silenced alerts atm. If I can find someone to pair with, I'll try to cleanup next week. If not, I'd appreciate the help the week after! [14:28:12] you can also simply ping the #wikimedia-traffic channel, they are always happy to review/look over LVS changes [14:38:17] will do, thanks! [14:40:21] brouberol: happy to help with that, ping us on Monday :) [14:40:50] thanks sukhe, will do ! [14:40:52] https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service [14:41:02] happy to help [15:19:30] 06Data-Engineering, 06Data Products: project-title-country missing US data in recent data, and double quote escaping - https://phabricator.wikimedia.org/T341139#9654322 (10VirginiaPoundstone) @Htriedman is this ticket ready to close? [15:40:35] 06Data-Engineering, 06Data Products: NEW BUG REPORT - Pageviews Missing Hourly Partition - https://phabricator.wikimedia.org/T358142#9654410 (10VirginiaPoundstone) @JEbe-WMF please rerun this job to see if it heals? [15:41:59] 06Data-Engineering, 06Data Products: NEW BUG REPORT - Pageviews Missing Hourly Partition - https://phabricator.wikimedia.org/T358142#9654416 (10VirginiaPoundstone) @lbowmaker is the fact that this keeps happening on this pipeline a platform issue that we should carve out time to resolve together? [15:44:48] 06Data-Engineering, 06Data Products, 10Pageviews-API, 10RESTBase-API, and 2 others: There are anomalies in some of the mostread data on zhwiki for March 2024 - https://phabricator.wikimedia.org/T360499#9654421 (10VirginiaPoundstone) Thanks for flagging this @Shizhao. It looks like a possible bot detection... [16:01:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:18:15] 06Data-Engineering, 10Event-Platform: Implement stream of HTML content on mw.page_change event - https://phabricator.wikimedia.org/T360794 (10lbowmaker) 03NEW [16:21:42] 10Data-Engineering (Sprint 9), 10Event-Platform: ProduceCanaryEvents job should be scheduled by Airflow and/or a k8s service - https://phabricator.wikimedia.org/T341229#9654560 (10lbowmaker) [16:24:22] (03CR) 10Milimetric: "nice job on the patch, I've run it with the following command from stat1004 (in a clone of the refinery repo with your patch applied):" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1013531 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [17:10:57] 06Data-Engineering, 06Data-Platform-SRE: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252#9654790 (10BTullis) [17:14:16] 06Data-Engineering, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999#9654821 (10nshahquinn-wmf) @AndrewTavis_WMDE that sounds great to me! I personally have no preference about the package we... [17:15:23] 06Data-Engineering, 06Movement-Insights, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999#9654825 (10nshahquinn-wmf) [17:24:13] 06Data-Engineering, 06Movement-Insights, 10Wikidata, 10Wmfdata-Python, 10Wikidata Analytics (Kanban): Add linter and formatter to wmfdata-python (and link check) - https://phabricator.wikimedia.org/T348999#9654871 (10AndrewTavis_WMDE) Exciting! I'll play around a bit towards the end of next week and send... [17:30:51] 06Data-Engineering, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 07Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9654887 (10BTullis) >>! In T347710#9645603, @brouberol wrote: > https://superset.wikimedia.org is now served by a servi... [19:09:47] 06Data-Engineering, 10Data Pipelines, 06SRE, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9655162 (10dr0ptp4kt) Okay, if I understand correctly, then the idea would be to... 1. Continue "allowing" tagging of wprov for non-200 HTTP responses. It'... [19:41:57] 06Data-Engineering, 10Observability-Logging, 06Traffic, 10Event-Platform, 13Patch-For-Review: Remove extra fields currently sent to Kafka - https://phabricator.wikimedia.org/T360642#9655231 (10Ottomata) > meta.id and meta.request_id `meta.id` is used to uniquely identify an event, and it is usually used... [20:45:23] 06Data-Engineering, 06Structured-Data-Backlog: Make HTML Dumps available in hadoop - https://phabricator.wikimedia.org/T305688#9655410 (10dr0ptp4kt) I'm interested as well, as I intend to looking at some image dumping stuff, and the surrounding HTML will be important for understanding context. If it isn't too... [21:28:13] (03PS4) 10Aleksandar Mastilovic: Add HQL query files for the "pingback" report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006970 [21:29:59] (03PS5) 10Aleksandar Mastilovic: Add HQL query files for the "pingback" report [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1006970 [23:09:03] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [23:09:04] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected