[00:40:09] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0: Page Analytics Service - https://phabricator.wikimedia.org/T288296 (10JArguello-WMF) [01:13:18] (03PS1) 10Clare Ming: Update metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) [03:55:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:05:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:33:28] btullis: o/ [08:11:43] started the DSE upgrade to 1.23 [08:23:21] ack [08:37:01] (03CR) 10Gehel: "one minor comment, otherwise LGTM" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [09:22:43] (03PS33) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [09:23:25] (03CR) 10Aqu: "Annotation fixed." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [09:38:56] elukey: Thanks for this. I'm here now in case I can help with anything. [09:42:11] btullis: np! I tried a shortcut, since the cookbook cannot do parallel reimages for the moment.. the upgrade cookbook is doing ctrl nodes and one worker (1001), the rest is done via manual parallel cookbook reimages (1002->1008) [09:43:26] Ack, thanks. [09:43:26] hope that all will finish in ~1h, then we'll need to deploy admin_ng stuff [09:43:29] and see if all works [09:44:12] Awesome. I'm still catching up on email after 4 days off, but otherwise around all day. Give us a shout if you'd like to look at anything together. [10:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:24] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) [10:20:53] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) Updated the description to recommend suppressing the warning over actually switching to SQLAlchemy. [10:38:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10akosiaris) >>! In T325303#8641318, @Ottomata wrote: > @JMeybohm @akosiaris, we plan to deploy to wikikube by the end of this quart... [10:45:26] 10Data-Engineering: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10KinneretG) Thanks for creating this task, @mpopov! Could we also add Partnerships and Community Programs to the list of those who have access? All three teams... [10:45:37] (03CR) 10Phuedx: [C: 04-1] "Since this is adding a property that supersedes another property, I'd argue that this requires incrementing the minor version rather than " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) (owner: 10Clare Ming) [10:47:57] (03CR) 10Phuedx: [C: 04-1] Update metrics_event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) (owner: 10Clare Ming) [10:55:25] btullis: o/ [10:55:39] so the good news is that I deployed up to the eventrouter stuff, all good [10:56:06] the bad news is that the reimage cookbook doesn't work (namely puppet fails) when running on 1005->1008, that are the hosts in row E/F [10:56:11] very suspicious [10:56:21] I can't really access any log atm [11:07:24] elukey: OK, interesting. Thanks for the update. [11:09:54] Is the host networking OK? Can they ping the core routers after boot etc? [11:10:18] Would you like me to look at anything in particular? [11:14:11] Yes, I see. `install_console` doesn't work and cannot log in as root over ipml. Tricksy. We could boot in recovery mode and test the networking? [11:14:59] btullis: we are talking about it in the IRC k8s channel, Arzhel/Alex are looking into it [11:15:04] seems that only ipv6 works [11:15:14] Ack, thanks. [12:33:13] 10Data-Engineering-Planning, 10Voice & Tone: Rename geoeditors_blacklist_country - https://phabricator.wikimedia.org/T259804 (10JArguello-WMF) [12:33:24] 10Data-Engineering: Define a list of exactly which alerts should page the Analytics team in VictorOps - https://phabricator.wikimedia.org/T296552 (10JArguello-WMF) [12:33:36] 10Data-Engineering-Icebox, 10Observability-Logging, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10JArguello-WMF) [12:35:46] 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10JArguello-WMF) [12:36:25] 10Data-Engineering: Enforce authentication for Kafka Jumbo Topics - https://phabricator.wikimedia.org/T255543 (10JArguello-WMF) [12:36:43] 10Data-Engineering: Upgrade Druid to latest upstream (> 0.20.1) - https://phabricator.wikimedia.org/T278056 (10JArguello-WMF) [12:37:17] 10Data-Engineering: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10JArguello-WMF) [12:37:56] 10Data-Engineering: Enforce authentication for Druid datasources - https://phabricator.wikimedia.org/T255545 (10JArguello-WMF) [12:38:02] 10Data-Engineering-Radar, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10JArguello-WMF) [12:38:06] 10Data-Engineering: Verify if Turnilo can pull data from Druid using Kerberos/TLS - https://phabricator.wikimedia.org/T250485 (10JArguello-WMF) [12:38:15] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10netops: Automate ingestion of netflow event stream - https://phabricator.wikimedia.org/T248865 (10JArguello-WMF) [12:38:45] 10Data-Engineering, 10Data-Persistence-Backup: Evaluate possible solutions to backup Analytics Hadoop's HDFS data - https://phabricator.wikimedia.org/T277015 (10JArguello-WMF) [12:39:04] 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE, 10serviceops-radar: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF) [12:39:11] 10Data-Engineering-Icebox, 10Data-Persistence-Backup: Implement production zookeeper backups - https://phabricator.wikimedia.org/T274808 (10JArguello-WMF) [12:39:16] 10Data-Engineering: Set yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds - https://phabricator.wikimedia.org/T269616 (10JArguello-WMF) [12:39:20] 10Data-Engineering: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10JArguello-WMF) [12:39:24] 10Data-Engineering: hdfs password file for mysql should be re-generated when the password file is changed by puppet - https://phabricator.wikimedia.org/T170162 (10JArguello-WMF) [12:39:28] 10Data-Engineering-Icebox: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337 (10JArguello-WMF) [12:47:16] 10Data-Engineering, 10VisualEditor, 10WMDE-TechWish, 10Editing-team (Tracking): Investigate missing dialog close events - https://phabricator.wikimedia.org/T272020 (10JArguello-WMF) [12:52:01] 10Data-Engineering, 10Project-Admins, 10PM: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10JArguello-WMF) Hi! Sorry for the delay, The team has discussed and decided that #analytics-data-quality should be archived , #analytics-clusters should be retagged as #shared-data-infrastructure ,... [14:12:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:17:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:18:19] (03PS2) 10Clare Ming: Update metrics_event schema to 1.2.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) [14:19:33] (03CR) 10Clare Ming: Update metrics_event schema to 1.2.0 (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891398 (https://phabricator.wikimedia.org/T330459) (owner: 10Clare Ming) [14:23:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:38:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:47:13] folks should we tune --^ with higher thresholds? [14:47:17] like 90/95? [14:57:55] 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [14:58:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) Done: {T330507} [14:59:18] 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [15:30:21] 10Data-Engineering, 10Project-Admins, 10PM: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) 05Open→03Resolved a:05odimitrijevic→03Aklapper Thank you! We're done, yay! [15:38:23] elukey: I have the feeling that it will just delay the alert but will finally reach it looking at the slow memory increase happening on the old generation: https://grafana-rw.wikimedia.org/d/000000379/hive?orgId=1&from=now-90d&to=now&viewPanel=14 [15:38:23] Will add some GC logs there to see if there are some bad pattern that would lead in no mixed GC happening on it [15:38:23] Perhaps adding the fact that old gc usage is also bigger than 70% [15:48:03] elukey: nfraison: Open ticket for it https://phabricator.wikimedia.org/T303168 - would be great to fix but I haven't had time so far. [15:48:49] btullis: ack [15:59:22] nfraison: definitely yes, +1 [16:00:46] 10Data-Engineering, 10Patch-For-Review: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) An other bad pattern to look at is the leak on MetaSpace (Non heap): {F36869496} Will need to add -XX:MaxMetaspaceSize JVM parameter as this space is not bo... [16:29:57] 10Data-Engineering, 10Abstract Wikipedia team, 10DiscussionTools, 10Growth-Team, and 7 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) 05In progress→03Resolved a:03Reedy [17:29:26] (03PS1) 10Neil P. Quinn-WMF: Update Wikipedia Preview ETL to extract instrumentation version [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/891866 (https://phabricator.wikimedia.org/T328703) [17:38:57] (03CR) 10Sbisson: [C: 03+1] Update Wikipedia Preview ETL to extract instrumentation version [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/891866 (https://phabricator.wikimedia.org/T328703) (owner: 10Neil P. Quinn-WMF) [19:23:21] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate import_cirrus_indexes.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329873 (10EBernhardson) a:03EBernhardson [19:25:55] 10Data-Engineering: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10mpopov) [19:29:31] 10Data-Engineering: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10mpopov) @KinneretG: Once it's loaded into Druid, everyone with Turnilo/Superset access will be able to access this data without any additional permissions :) B... [19:30:21] 10Data-Engineering: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10mpopov) [19:34:19] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, and 2 others: Expose rdf-streaming-updater.mutation content through EventStreams - https://phabricator.wikimedia.org/T294133 (10dcausse)