[00:20:15] 10Data-Engineering, 10Project-Admins: Create Project tag for Superset - https://phabricator.wikimedia.org/T298575 (10odimitrijevic) [00:23:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Services, 10cloud-services-team (Kanban): Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 (10odimitrijevic) p:05Triage→03Medium [00:25:08] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:35:45] 10Data-Engineering: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10odimitrijevic) [00:37:16] 10Data-Engineering: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10odimitrijevic) @BTullis We discussed this a while back. Adding the task given the upcoming planning as well as https://phabricator.wikimedia.org/T298087 [00:39:14] 10Data-Engineering, 10Data-Engineering-Kanban: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10odimitrijevic) p:05Triage→03High [00:56:43] 10Analytics, 10Data-Engineering, 10Readers-Web-Backlog (Needs Prioritization (Tech)), 10Wikimedia-production-error: eventgate_validation_error: '.web_session_id' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T297521 (10odimitrijevic) The error starts on 12/01/2021 - were ther... [00:57:06] 10Analytics-Radar, 10Readers-Web-Backlog (Needs Prioritization (Tech)), 10Wikimedia-production-error: eventgate_validation_error: '.web_session_id' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T297521 (10odimitrijevic) [01:06:23] 10Data-Engineering, 10Project-Admins: Create project tag for Data-Engineering-Radar - https://phabricator.wikimedia.org/T298580 (10odimitrijevic) [01:07:06] 10Data-Engineering, 10Project-Admins: Create project tag for Data-Engineering-Radar - https://phabricator.wikimedia.org/T298580 (10odimitrijevic) Can the herald to remove the "Data-Engineering" tag when "Data-Engineering-Radar" be implemented with the same time? [03:53:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [04:08:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [04:08:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [04:13:57] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [05:56:27] 10Data-Engineering, 10Project-Admins: Create project tag for Data-Engineering-Radar - https://phabricator.wikimedia.org/T298580 (10Aklapper) 05Open→03Resolved a:03Aklapper Requested public project #data-engineering-radar has been created: https://phabricator.wikimedia.org/project/view/5682/ (In case you... [06:00:49] 10Analytics-Kanban, 10Product-Analytics, 10Superset, 10Tracking-Neverending: Superset Updates - https://phabricator.wikimedia.org/T211706 (10Aklapper) [06:01:45] 10Data-Engineering, 10Project-Admins: Create Project tag for Superset - https://phabricator.wikimedia.org/T298575 (10Aklapper) 05Open→03Resolved a:03Aklapper Requested public project #Superset has been created: https://phabricator.wikimedia.org/project/view/5683/ (In case you need to edit the project its... [06:59:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [07:00:27] 10Data-Engineering: Users without any grants in dbstore - https://phabricator.wikimedia.org/T298589 (10Ladsgroup) [07:19:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [08:02:53] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:24:19] 10Data-Engineering, 10User-Ladsgroup: Users without any grants in dbstore - https://phabricator.wikimedia.org/T298589 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It looks needed. [08:25:03] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [10:09:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [10:48:00] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:10:32] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:15:03] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) After the loading jobs that ended shortly after midnight UTC, the behaviour seems much impro... [11:35:33] !log Upgrading hive packages on an-test-coord1001 to test log4j changes. [11:35:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:37:06] !log Upgrading hive on an-test-client1001 in order to test log4j upgrade [11:37:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:56:17] 10Data-Engineering, 10Data-Engineering-Kanban: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10BTullis) a:03BTullis [16:15:06] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10odimitrijevic) The cluster migration took about 2-3 months, and while little of that time was spent in active maintenance/work it was a long lasting background t... [16:33:48] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) Looking at the last 30days of heap usage for 1010-a... {F34908493} You can see that after t... [17:01:21] 10Data-Engineering, 10Data-Engineering-Kanban: Create Analytics Network Diagram & Documentation - https://phabricator.wikimedia.org/T298577 (10BTullis) I have begun working on this, starting with a physical layout diagram. https://drive.google.com/file/d/1G_ql8YQ0JPQVyebAk4YT7Z5ZE9RCWs8e/view?usp=sharing I w... [17:09:41] joal: Hello! Comment étaient ces vacances? [17:09:56] Salut gehel :) [17:10:05] holidays were great! how about you? had some? [17:10:28] yep, a week and a bit. Just enough time to eat too much and reorganize the kids rooms! [17:10:50] David has an open patch on event-utilities: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/751680 [17:11:16] looking at the access rights for that repo, it looks like analytics is the sole owner of that repo [17:11:24] hm [17:11:32] Will read :) [17:11:45] would it make sense to give at least +2 access to the search platform team, so that we don't need to bother you for changes there? [17:12:15] I'd very much agree gehel, and I think Andrew would too :) [17:12:21] That specific change is super minor, but there might be larger changes where you might want to be in the loop. [17:12:30] ack [17:12:43] but you ofc trust us to let you know :) [17:13:11] of course gehel, plus we could be automatic reviewers, in order to be metionned :) [17:13:20] yep [17:14:20] joal: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/751542 (not sure if you can merge or if we need someone from releng to +2) [17:15:12] and thanks! [17:15:34] gehel: I just +2ed on CR - Do I need to +2 verify and manually merge? [17:15:56] I think so. I doubt we have jenkins checking the gerrit config changes [17:16:04] ack will do [17:16:32] gehel: merged! [17:17:01] it works! I just merged dcausse's change [17:17:05] \o/ [17:17:11] thanks :) [17:17:17] you're welcome gehel :) [17:17:25] thanks! [17:17:54] Hey dcausse! hello :) [17:18:01] hello! [17:18:12] and obviously gehel and dcausse: Happy new year :) [17:18:18] bonne année! [17:18:57] joyeuse Pâques ! [17:19:10] vous connaissez lapinou evidemment? [17:19:46] lapinou? [17:19:52] lapinou year! [17:20:09] #jesors [17:20:17] ouch :/ [17:21:27] :) [17:36:32] (03CR) 10Razzi: [V: 03+2 C: 03+2] Upgrade Superset to version 1.3.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/745212 (https://phabricator.wikimedia.org/T295983) (owner: 10Btullis) [17:46:11] (03PS2) 10DCausse: rdf_streaming_updater: add a reconcile event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/740109 (https://phabricator.wikimedia.org/T279541) [18:03:11] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10odimitrijevic) Cassandra cluster table size info: https://grafana.wikimedia.org/d/kUVKEvaWz/cassandra-storage?viewPanel=51&orgId=1&from=now-24h&to=now&var-dataso... [18:09:05] 10Data-Engineering, 10Generated Data Platform: Set up regular-repairs for AQS cassandra cluster tables - https://phabricator.wikimedia.org/T297944 (10odimitrijevic) p:05Triage→03High Marking as high given the timeliness of the decision as we switch to the new cluster. @JAllemandou can you please share inf... [18:15:10] 10Analytics-Clusters, 10Data-Engineering, 10Superset, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10odimitrijevic) [18:16:03] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10odimitrijevic) [18:17:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Epic: Presto/Superset User Experience Improvement - https://phabricator.wikimedia.org/T294259 (10odimitrijevic) [18:26:53] 10Analytics, 10Data-Engineering, 10Inuka-Team, 10Product-Analytics, 10Superset: Superset timeouts for KaiOS dashboard - https://phabricator.wikimedia.org/T277320 (10odimitrijevic) [18:27:24] 10Data-Engineering, 10Superset: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10odimitrijevic) [18:30:37] 10Data-Engineering, 10Superset: Fix the LDAP integration and Superset user account creation. - https://phabricator.wikimedia.org/T298647 (10odimitrijevic) [18:31:13] 10Data-Engineering, 10Superset: Fix the LDAP integration and Superset user account creation. - https://phabricator.wikimedia.org/T298647 (10odimitrijevic) p:05Triage→03Medium [18:32:17] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10odimitrijevic) 05Open→03Resolved [18:32:26] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Presto error in Superset - https://phabricator.wikimedia.org/T292879 (10odimitrijevic) Opened https://phabricator.wikimedia.org/T298647 as followup task. [18:33:32] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Update refinery gobblin jars to use new gobblin-wmf jars and update puppet gobblin jobs - https://phabricator.wikimedia.org/T297939 (10odimitrijevic) p:05Triage→03Medium [18:34:59] (03CR) 10Andrew Bogott: [C: 03+1] Add prometheus metrics [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/750448 (owner: 10Majavah) [18:44:01] (03CR) 10Majavah: [C: 03+2] Add prometheus metrics [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/750448 (owner: 10Majavah) [18:47:43] (03Merged) 10jenkins-bot: Add prometheus metrics [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/750448 (owner: 10Majavah) [19:15:32] 10Data-Engineering, 10Data-Services, 10cloud-services-team (Kanban): Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10nskaggs) [19:16:06] !log Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-04T1[5789]:00:00, dropping malformed rows as discussed with schema owner [19:16:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:21:32] 10Data-Engineering, 10Data-Engineering-Kanban: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10odimitrijevic) [20:03:56] ok team, logging off for today - see you tomorrow! [20:53:13] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10odimitrijevic) [21:40:40] 10Analytics-Radar, 10Readers-Web-Backlog (Needs Prioritization (Tech)), 10Wikimedia-production-error: eventgate_validation_error: '.web_session_id' should NOT be shorter than 20 characters - https://phabricator.wikimedia.org/T297521 (10Jdlrobson) > The error starts on 12/01/2021 - were there any changes depl... [21:43:57] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) A heap dump is likely to be the best means of identifying what is holding up all of this memo... [22:56:09] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10odimitrijevic) Could we give the upgrade a try to see if it resolves the memory leak, and if not only... [23:13:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Apache atlas build fails due to expired certificate (https://maven.restlet.com) - https://phabricator.wikimedia.org/T297841 (10odimitrijevic) > You might be able to workaround this by manually downloading and then uploading this dependency to our a... [23:15:22] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Add Jenkins job for gobblin-wmf jar release to archiva - https://phabricator.wikimedia.org/T297938 (10odimitrijevic) p:05Triage→03Medium [23:34:09] 10Data-Engineering, 10Data-Engineering-Kanban: Requesting Kerberos Identity - https://phabricator.wikimedia.org/T297114 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic