[09:29:32] (03CR) 10Gmodena: [C:03+2] "Self merging. This change only affects the `development` workspace." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/1021902 (https://phabricator.wikimedia.org/T351117) (owner: 10Gmodena) [09:51:58] (03PS8) 10Gmodena: hql: webrequest: add webrequest_frontend. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1012656 (https://phabricator.wikimedia.org/T314956) [10:06:30] We're running a longer job on stat1009 this time, could end up taking several weeks. Please feel free to kill if needed, or write us. [10:08:53] (03CR) 10Btullis: [C:03+2] Drop vestiges of git-fat [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [10:08:56] (03CR) 10Btullis: [V:03+2 C:03+2] Drop vestiges of git-fat [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [10:09:27] (03CR) 10Btullis: [V:03+2 C:03+2] "Done" [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [10:14:36] (03PS1) 10Btullis: Remove old hadoop coordinators from scap targets [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/1023393 (https://phabricator.wikimedia.org/T353774) [10:20:45] (03CR) 10Btullis: [V:03+2 C:03+2] Remove old hadoop coordinators from scap targets [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/1023393 (https://phabricator.wikimedia.org/T353774) (owner: 10Btullis) [10:44:25] 06Data-Engineering, 10Cassandra, 06Data-Persistence, 06Data-Platform-SRE: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181#9734941 (10JAllemandou) I'm not sure if the spark-cassandra-connector can read a Java Truststore on HDFS! I'd go for an automated deployment of... [10:55:07] 10Data-Engineering (Q4 2024 April 1st - June 30th), 06Data Products: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes - https://phabricator.wikimedia.org/T355588#9734991 (10JAllemandou) a:03JAllemandou [10:55:37] (03PS1) 10Joal: Update sqoop and schema of pagelink table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1023399 (https://phabricator.wikimedia.org/T345771) [10:55:38] awight: Thanks for the heads-up. [11:38:22] btullis: I'm sure you have a long review backlog but I wanted to highlight this patch which is connected to my running job: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023085 [11:41:52] awight: Looks good to me. Merged. If you could follow it up with a patch to remove it, but mark it WIP or something, that will make it easier to remember to tidy it up. Thanks. [11:44:22] Good idea, done! [11:45:00] Thanks. [12:13:38] !log deploy conda-analytics v 0.0.29 to hadoop test cluster T362648 [12:13:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:13:41] T362648: Rebuild conda-analytics container on Bullseye - https://phabricator.wikimedia.org/T362648 [12:30:34] btullis: How long do you think it should take until the new prometheus metrics land on thanos? [12:34:35] awight: Hmm. The file exists on prometheus1006 already. [12:34:38] https://www.irccloud.com/pastebin/Hy0ot1vZ/ [12:34:45] Oh, maybe it's a firewall issue. [12:36:02] !log deploy conda-analytics v 0.0.29 to analytics hadoop workers T362648 [12:36:04] There isn't a script to get prometheus1006 to reload its targets? [12:36:05] Interesting, IPv6 failed. IPv4 v4 succeeded. [12:36:09] https://www.irccloud.com/pastebin/zzqmeneT/ [12:36:46] ouch, probably an app-level issue. but hopefully prom will fallback/default to ipv4 [12:37:35] And yes, possibly it doesn't auto-reload targets. [12:37:38] https://www.irccloud.com/pastebin/On7rE5wq/ [12:37:46] I'll check now. [12:41:10] very nice, TIL https://prometheus-eqiad.wikimedia.org/analytics/ . but it's down, maybe because of a service reload... [12:43:10] Uh oh. Maybe I broke it. [12:44:06] !log deploy conda-analytics v 0.0.29 to analytics hadoop coordinator T362648 [12:46:12] I have asked in #wikimedia-observability [12:50:03] !log deploy conda-analytics v 0.0.29 to analytics stat hosts T362648 [12:50:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:50:06] T362648: Rebuild conda-analytics container on Bullseye - https://phabricator.wikimedia.org/T362648 [12:59:52] !log deploy conda-analytics v 0.0.29 to analytics-airflow hosts T362648 [12:59:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:59:57] T362648: Rebuild conda-analytics container on Bullseye - https://phabricator.wikimedia.org/T362648 [12:59:59] hey folks! [13:00:07] if you are ok I'd move aqs-eqiad to PKI [13:21:36] proceeding :) [13:31:32] 06Data-Engineering, 06Data-Platform, 10Wikidata, 10Wikidata-Query-Service: WDQS updater missed some updates - https://phabricator.wikimedia.org/T362977#9735577 (10lbowmaker) [13:45:03] roll restart in progress, so far all good [14:07:01] 06Data-Engineering, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 9 others: Upgrade mobileapps to node 18 - https://phabricator.wikimedia.org/T363168 (10Jgiannelos) 03NEW [14:07:16] 06Data-Engineering, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 9 others: Upgrade mobileapps to node 18 - https://phabricator.wikimedia.org/T363168#9735740 (10Jgiannelos) a:03Jgiannelos [14:08:06] 06Data-Engineering, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 9 others: Upgrade mobileapps to node 18 - https://phabricator.wikimedia.org/T363168#9735744 (10Jgiannelos) [14:33:41] (03CR) 10Xcollazo: "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1023399 (https://phabricator.wikimedia.org/T345771) (owner: 10Joal) [14:36:07] elukey: Sorry I missed the ping. Please feel free to go ahead, retrospectively :-) [14:36:49] thanks!! Almost done, hopefully reporting good news in ~30 mins [15:05:48] aqs on pki! [15:10:24] woohoo nicely done [15:10:33] sorry. I missed all the daily pings [15:10:49] Awesome. [15:10:55] Thanks elukey. [15:11:01] +! [15:11:05] *+1 [15:15:50] last change for the final cleanup is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023453, but we can do it anytime [15:22:50] 06Data-Engineering, 10Cassandra, 06Data-Persistence, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181#9736088 (10Gehel) [15:24:17] 06Data-Engineering, 10Cassandra, 06Data-Persistence, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181#9736106 (10elukey) Filed a change for the stat nodes, the hadoop worker nodes already have the t... [15:24:52] 06Data-Engineering, 10Cassandra, 06Data-Persistence, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Encrypt Airflow connections to AQS Cassandra - https://phabricator.wikimedia.org/T362181#9736108 (10elukey) Also I confirm that AQS Cassandra runs now with PKI TLS certs, so we can star... [15:45:30] 06Data-Engineering, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Migrate the matomo host to bookworm - https://phabricator.wikimedia.org/T349397#9736183 (10jcrespo) 05Resolved→03Open [15:46:58] 06Data-Engineering, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Migrate the matomo host to bookworm - https://phabricator.wikimedia.org/T349397#9736182 (10jcrespo) hi, backups of matomo database failed with: ` 2024-04-23 04:13:04 [ERROR] - Error connecting to database: Access denied... [15:57:57] (03PS1) 10Xcollazo: WIP: SQL queries that format the base Commons Impact Metrics datasets into the expected shape for Cassandra. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1023461 (https://phabricator.wikimedia.org/T358707) [15:59:06] (KafkaReplicationFactorTooLow) firing: (958) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [16:00:53] (HdfsDataNodeHeapUsage) firing: Datanode heap usage on an-worker1154:51010 is above 90%. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Datanode_JVM_Heap_Usage - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&panelId=1&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DHdfsDataNodeHeapUsage [16:04:06] (KafkaReplicationFactorTooLow) resolved: (958) Kafka topic DataHubUpgradeHistory_v1 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [16:05:53] (HdfsDataNodeHeapUsage) resolved: (2) Datanode heap usage on an-worker1133:51010 is above 90%. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Datanode_JVM_Heap_Usage - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&panelId=1&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DHdfsDataNodeHeapUsage [16:07:23] (HdfsDataNodeHeapUsage) firing: (2) Datanode heap usage on an-worker1154:51010 is above 90%. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Datanode_JVM_Heap_Usage - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&panelId=1&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DHdfsDataNodeHeapUsage [16:12:23] (HdfsDataNodeHeapUsage) resolved: (3) Datanode heap usage on an-worker1133:51010 is above 90%. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Datanode_JVM_Heap_Usage - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&panelId=1&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DHdfsDataNodeHeapUsage [18:11:11] 06Data-Engineering, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Migrate the matomo host to bookworm - https://phabricator.wikimedia.org/T349397#9736967 (10jcrespo) 05Open→03Resolved Looking good now: {F48315862} [18:19:36] 06Data-Engineering, 06Movement-Insights, 06Research-Freezer: Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices - https://phabricator.wikimedia.org/T336715#9736996 (10Mayakp.wiki) [18:24:21] 06Data-Engineering, 06Movement-Insights, 06Research-Freezer: Investigate relation of Prefetch feature to increase in automated traffic and impact on unique devices - https://phabricator.wikimedia.org/T336715#9737002 (10Mayakp.wiki) We are leaning towards switching off the Prefetch feature since the fraction... [19:23:19] (03PS1) 10Mforns: Modify Commons Impact Metrics queries to ignore ancestor categories [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1023491 (https://phabricator.wikimedia.org/T358699) [19:37:39] (03PS1) 10Mforns: Correctly apply distanceToPrimary in CommonsCategoryGraphBuilder [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1023492 (https://phabricator.wikimedia.org/T358699) [19:46:36] (03PS1) 10Xcollazo: Calculate deep counts for primary categories only. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1023494 (https://phabricator.wikimedia.org/T358681) [21:03:40] (03PS10) 10Aqu: Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) [21:03:50] (03CR) 10CI reject: [V:04-1] Add CLI to create or update Iceberg tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1016808 (https://phabricator.wikimedia.org/T356762) (owner: 10Aqu) [22:01:24] (03PS1) 10Gerrit maintenance bot: Add ka.wikisource to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1022537 (https://phabricator.wikimedia.org/T363243) [22:02:07] (03PS1) 10Gerrit maintenance bot: Add kaa.wiktionary to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1022540 (https://phabricator.wikimedia.org/T363256) [22:02:52] (03PS1) 10Gerrit maintenance bot: Add igl.wikipedia to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1022543 (https://phabricator.wikimedia.org/T363263) [22:03:18] (03PS1) 10Gerrit maintenance bot: Add my.wikisource to pageview allowlist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1022545 (https://phabricator.wikimedia.org/T363270) [22:44:33] 06Data-Engineering, 10EventStreams, 10stewardbots, 10Event-Platform: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher - https://phabricator.wikimedia.org/T329327#9738587 (10bd808) >>! In T308931#7950582, @Ottomata wrote: > This... [22:44:55] 06Data-Engineering, 10EventStreams, 10stewardbots, 10Toolforge, and 2 others: Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher - https://phabricator.wikimedia.org/T329327#9738589 (10bd808)