[00:00:54] 10Data-Engineering, 10Cassandra, 10Research: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10Eevans) [00:46:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:28] 10Analytics-Radar, 10Growth-Team, 10Browser-Support-Apple-Safari: Investigate how Safari in iOS 17 and macOS Sonoma will impact URLs generated in Wikimedia sites - https://phabricator.wikimedia.org/T338571 (10DLynch) I saw [a news article today](https://www.macrumors.com/guide/ios-17-privacy-security/) which... [04:46:46] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:50] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search, 10Event-Platform Value Stream: Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10elukey) A very cool name could be "Octopus" (tentacles spreading in multiple dcs at once). [07:14:56] !log `sudo kill `pgrep -u paramd`` on stat1005 to unblock puppet [07:14:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:53:42] (03CR) 10Aqu: "Hello, thanks for the patch. The configuration is now here:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/930895 (https://phabricator.wikimedia.org/T339805) (owner: 10Urbanecm) [07:55:35] 10Data-Engineering-Planning, 10Data Pipelines, 10Privacy Engineering, 10Research, 10Patch-For-Review: Add cswiki to clickstream - https://phabricator.wikimedia.org/T339805 (10Antoine_Quhen) Hello @lbowmaker , for this ticket, a patch has already been proposed. [08:01:20] (03CR) 10Aqu: [V: 03+2 C: 03+2] "Thanks all." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) (owner: 10Aqu) [08:06:38] (03PS3) 10Aqu: Fix doc of create_disallowed_cassandra_articles_table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906707 (https://phabricator.wikimedia.org/T333950) [08:29:01] I'm planning on trying to upgrade an-test-worker100[2-3] to bullseye today, in sequence. [08:37:26] ack btullis - thanks for the heads up [08:38:45] !log revoked puppet cert for 'varnishkafka' and cleaned up its cergen's files in puppet private [08:38:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:39:42] 10Data-Engineering, 10Data-Platform-SRE, 10SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) 05Open→03Resolved a:03elukey [08:41:51] 10Analytics-Radar, 10Patch-For-Review, 10User-TheDJ: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 (10TheDJ) p:05Triage→03Low a:03TheDJ [08:46:02] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've made some good progress, but I'm currently stuck trying to get the `datahub-main-nocode-migration-job` to talk to `datahub-gms-main-tls-service.datahub.svc.cluster.local`on port 8501... [08:48:22] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10Stevemunene) >>! In T317861#8963868, @Stevemunene wrote: > During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from... [08:51:46] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:52] 10Data-Engineering: Check home/HDFS leftovers of neilpquinn-wmf - https://phabricator.wikimedia.org/T340524 (10MoritzMuehlenhoff) [09:13:13] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 14 B): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10gmodena) a:05gmodena→03None [09:15:05] 10Analytics-Radar, 10Growth-Team, 10Browser-Support-Apple-Safari: Investigate how Safari in iOS 17 and macOS Sonoma will impact URLs generated in Wikimedia sites - https://phabricator.wikimedia.org/T338571 (10TheDJ) I'm really curious how this will pan out, because the difference between a tracking id and an... [09:21:03] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-test-worker1002.eqiad.wmnet with OS bullseye [09:21:20] !log upgrading an-test-worker1002 to bullseye, keeping `/srv/hadoop` intact [09:21:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:36:20] Hi team - I plan to deploy refinery, airflow, and some Puppet (with btullis or steve's help) today - Please review/update https://etherpad.wikimedia.org/p/analytics-weekly-train to add you stuff if you wish I deploy it :) [09:37:32] Ack joal thanks. [09:38:05] !log Exclude analytics1061_1069 from HDFS and YARN [09:38:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:40:11] Ack, thanks joal. I still had one question on https://gerrit.wikimedia.org/r/c/operations/puppet/+/919019 but I'm happy to merge (or for Steve to merge) and deploy if you like. [09:40:43] aqu: Hi - I think the question from btullis is for you :) --^ [09:41:40] Also, there is a question about when we should upgrade the `analytics` Airflow instance to version 2.6.1 - It doesn't have to be today, but it's an option: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933087 [09:42:07] !log !log run puppet on hadoop-masters this does a refresh of the hdfs nodes [09:42:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:42:29] btullis: I'llfollow you on the Airflow upgrade - Does it require anychange froman airflow dags perspective? [09:44:11] joal: I believe that this is required: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/434 It is a workaround to ensure that airflow-dags works with both 2.5 and 2.6, without requiring a feature branch. [09:45:54] ack btullis - I just read the change, and I'llsupport you if you wishwe migrate today [09:46:46] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1064:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:45] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:46] (SystemdUnitFailed) firing: (7) hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:02] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-test-worker1002.eqiad.wmnet with OS bullseye completed... [10:06:46] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:18] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10JMeybohm) My understanding from the SLO of mw-page-content-change-enrich was that re-deploying the application (e.g. loosing the con... [10:25:22] The reimage of an-test-worker1002 went well it seems. stevemunene do you want to do the reimage of the final hadoop worker in the test cluster, or would you rather I do it? [10:26:14] May I do it? [10:26:28] cool [10:35:02] stevemunene: Yep, feel free. [10:39:41] joal: I have added two airflow-related MRs to the etherpad. I think that we can merge these and do a normal airflow deploy, to make sure that the compatibility works and that no hot-fixes are in place. [10:39:59] ack btullis [10:40:07] Then we can think about upgrading the instance(s) tomorrow or later still. [10:41:35] btullis:do we really wish to deploy https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/385 [10:41:39] ? [10:46:51] joal: Maybe not. Maybe we should just merge https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/434/diffs and then rebase !385 from it. That should make !385 a noop, as far as the DAGs are concerned, I think. [10:48:06] btullis: I indeed think !385 is to be merged/deployed during the upgrade, while !434 is noop normally, and helps the code to work before AND after upgrade [10:49:01] btullis: I just moved the paragraph about Airflow upgrate above in the deploy ehterpad, to make them separate [10:50:19] OK. I originally thought that if we merge the two together we would be best-prepared for the upgrade, but I'm happy to defer to your judgement. Most of the changes in !385 are about building the package, which is already done. [10:51:49] Inmy understanding, !385 is about applying package-changes to the Airflow instance - This should be done at upgrade time IMO [10:51:52] :) [10:52:02] So, let's go with what I currently have! [10:52:12] Deploy starts now with refinery [10:53:55] joal: No, applying package changes to the airflow instance is done entirely with a puppet change. e.g.: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933087/ - There is one puppt change like this per instance, ready to go. [10:54:40] Anyway, all good. Happy to go with your suggestion :-) [10:54:46] Right btullis - But the !385 MR applies such "package-change"inthe docs and conda-env description used by the host [10:55:28] If I merge it now, we'll have a discrepency between actually used airflow version and documented version [10:55:50] !log Deploy refinery using scap [10:55:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:57:29] Yes, I see what you mean. +1 [10:57:48] ack btullis :) [11:01:00] !log upgrading an-test-worker1003 to bullseye, keeping `/srv/hadoop` intact [11:01:01] PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:01:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:07] PROBLEM - Check systemd state on analytics1067 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:45] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [11:10:03] !log Deploy refinery onto HDFS [11:10:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:22:06] btullis: Would you mind merging/deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/919019 please [11:28:36] joal: Done. Would you like it pushed out immediately to all airflow instances, or is the 30 minute staggered deployment fine? [11:34:03] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > My understanding from the SLO of mw-page-content-change-enrich was that re-deploying the application (e.g. loosing the co... [11:52:18] RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:56:46] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:41] Heya btullis - the 30 minutes staggered deploy has done its job I guess :) [12:36:18] 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10Jclark-ctr) 05Open→03Resolved @BTullis Replaced Battery host is booting now. Thanks for your assistance [12:39:26] 10Data-Engineering-Planning: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Ottomata) Alright! We are finally on Spark 3, and deployed the [[ https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/808978/1/refinery-spark/src/main/scala/org/wikimedia/analytics/ref... [12:46:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) gmodena merged https://gitlab.wiki... [12:46:08] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mediawiki-event-enrichment: changes to test image seem to be ignored in CI - https://phabricator.wikimedia.org/T340195 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_request... [12:47:10] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10Ottomata) >> The wikitech doc now says that we (as in service ops) are required to save and restore the state. > This would only be... [12:51:22] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): mw-page-content-change-enrich should enable HA with k8s ConfigMaps - https://phabricator.wikimedia.org/T338233 (10gmodena) > By cluster updates, you mean kubernetes cluster, right? Correct; k8s clusters. Flink (cluster) updates should not requi... [12:53:07] 10Data-Engineering-Planning, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) I think we can close this? In {T330236} we merged and deployed https://gerrit.wikimedia.org/r/c/analytics/refinery/sour... [12:53:36] 10Data-Engineering-Planning, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) BTW, I don't think this is related to {T309717} anymore. [12:54:12] aqu: I confirm the datahub-error in airflow is gone since the puppet patch got released - Thank you so much for this :) [12:54:37] Ok, now, deploying airflow (this is today's big beast) [12:55:26] 10Data-Engineering-Planning: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Ottomata) > The conditional checks if the type is Java null (meaning not present), or if it is set to JSONSchema "null". Wait, no, the check that throws this error is: `lang=java if (schemaTyp... [12:55:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [12:56:00] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10Gehel) [12:56:04] 10Data-Engineering: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Ottomata) [12:56:20] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Find why the wdqs-updater service can't start after data-transfer and fix it - https://phabricator.wikimedia.org/T339368 (10Gehel) [12:57:27] 10Data-Platform-SRE, 10Prod-Kubernetes, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes - https://phabricator.wikimedia.org/T293063 (10Gehel) [12:57:35] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Make WCQS/WDQS data transfer cookbook more reliable - https://phabricator.wikimedia.org/T321605 (10Gehel) [12:57:47] 10Data-Platform-SRE, 10Elasticsearch, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Switching search traffic between datacenters should be faster - https://phabricator.wikimedia.org/T143553 (10Gehel) [12:58:26] 10Data-Engineering: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10JAllemandou) /me would love to see aunit-test replicating this :) [12:58:36] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [12:58:55] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) [13:05:03] (03PS1) 10Ottomata: JsonSchemaConverter - log full JSONSchema when converting to Spark fails [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/933456 (https://phabricator.wikimedia.org/T309717) [13:06:20] aqu: Hi! Ineed your help please :) [13:07:33] Yes [13:07:46] aqu: batcave quickly? [13:16:02] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate and eventstreams helm chart to use automatic kafka egress networkpolicies and envoy service mesh - https://phabricator.wikimedia.org/T335024 (10JArguello-WMF) [13:25:01] !log Deploy Airflow [13:25:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:25:10] 10Data-Platform-SRE, 10Discovery-Search: Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:25:24] 10Data-Platform-SRE, 10Discovery-Search: Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10Gehel) [13:25:26] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10Gehel) [13:25:31] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10Gehel) [13:26:36] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye execu... [13:27:57] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-test-worker1003.eqiad.wmnet with OS bullseye [13:32:06] !log druid_load_pageviews_hourly_aggregated_dailyRerun [13:32:06] Schedule: @daily info Next Run: 2023-06-27, 00:00:00 [13:32:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:32:22] !log Rerun druid_load_pageviews_hourly_aggregated_daily after deploy [13:32:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:34:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [13:39:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [13:39:55] trying here: ping mforns? [13:42:40] (03CR) 10Joal: [C: 03+1] JsonSchemaConverter - log full JSONSchema when converting to Spark fails [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/933456 (https://phabricator.wikimedia.org/T309717) (owner: 10Ottomata) [13:46:35] (03CR) 10Joal: fix the metric query. Strip TLDs from domain projects. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [13:47:55] aqu: Should I merge https://gerrit.wikimedia.org/r/c/analytics/refinery/+/906707 ? [13:48:31] Yes please. [13:48:37] ack,doing so [13:48:44] (03CR) 10Joal: [V: 03+2 C: 03+2] "LGTM! Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/906707 (https://phabricator.wikimedia.org/T333950) (owner: 10Aqu) [13:54:59] joal: hey! [13:55:56] (03CR) 10Ottomata: [C: 03+2] JsonSchemaConverter - log full JSONSchema when converting to Spark fails [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/933456 (https://phabricator.wikimedia.org/T309717) (owner: 10Ottomata) [14:03:02] 10Data-Engineering, 10Patch-For-Review: Event Utilities partially downloads schemas - https://phabricator.wikimedia.org/T309717 (10Ottomata) Added more logging, and scheduled it for next week's train: https://etherpad.wikimedia.org/p/analytics-weekly-train > would love to see aunit-test replicating this :) I... [14:43:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [14:53:48] !log deployed airflow analytics to unbreak DataHub's Druid ingestion [14:53:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) [15:06:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve Event Platform and MediaWiki Event Enrichment wikitech documentation - https://phabricator.wikimedia.org/T329629 (10Ottomata) - Moved decision log to wikitech: https://wikitech.wikimedia.org/wiki/Event_Platform/Decision_Log - Ad... [15:08:34] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10bking) Looking at [[ https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?from=1687816800000&orgId=1&to=1687823999000&var-clust... [15:29:55] 10Data-Platform-SRE: Restart buster query service hosts (wdqs/wcqs) to apply java8 sec upgrades - https://phabricator.wikimedia.org/T340482 (10RKemper) [15:30:18] 10Data-Platform-SRE: Restart buster query service hosts (wdqs/wcqs) to apply java8 sec upgrades - https://phabricator.wikimedia.org/T340482 (10RKemper) This should be done, but I haven't yet ran a validation command to sanity check that the correct version is in place. [15:37:51] 10Data-Platform-SRE, 10Discovery-Search: Determine whether or not to change CPU frequency governor on Search Platform-owned hosts - https://phabricator.wikimedia.org/T340554 (10bking) [15:43:25] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10Gehel) [15:46:05] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10Gehel) p:05Triage→03High [15:46:07] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10Gehel) p:05Triage→03Low [15:46:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) p:05Triage→03High [15:46:58] 10Data-Platform-SRE, 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10Gehel) p:05Triage→03High [15:56:46] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:34] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10BTullis) 05Open→03Resolved [15:59:53] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Add Python Linter Checks to CI - https://phabricator.wikimedia.org/T318346 (10Antoine_Quhen) Documentation added here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Developer_guide [16:50:04] 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [16:50:33] 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [17:00:41] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 14), 10Patch-For-Review: Wikistats Bug: Small countries not displayed on the map - https://phabricator.wikimedia.org/T338033 (10Antoine_Quhen) 05Open→03Resolved 3 of our dataset are now going to use `canonical.countries.is_prote... [17:10:14] (03PS12) 10Nick Ifeajika: Minor fix: Remove the `_del` field from the query. It is automatically added in cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [17:17:09] (03CR) 10Joal: "Still two things (commit message, and commented code)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [17:27:56] (03PS13) 10Nick Ifeajika: Minor fix: Remove the `_del` field from the query. It is automatically added in cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [17:29:58] (03PS14) 10Nick Ifeajika: Minor fix: Remove the `_del` field from the query. It is automatically added in cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [17:37:45] (03PS15) 10Nick Ifeajika: Add query to load data from knowledge_gap.content_gap_metrics to cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) [17:39:39] (03CR) 10Joal: [C: 03+1] "LGTM! Letting Dan close his comment and merge as needed" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914799 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [18:04:09] 10Data-Engineering, 10Cassandra, 10Research: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10leila) [18:04:20] 10Data-Engineering, 10Cassandra: Create keyspace and table for Knowledge Gaps - https://phabricator.wikimedia.org/T340494 (10leila) [18:07:27] (03PS1) 10Nick Ifeajika: Rebase on AQS to accomodate a few last minute changes to AQS [analytics/aqs] - 10https://gerrit.wikimedia.org/r/933603 (https://phabricator.wikimedia.org/T337059) [18:12:06] (03CR) 10CI reject: [V: 04-1] Rebase on AQS to accomodate a few last minute changes to AQS [analytics/aqs] - 10https://gerrit.wikimedia.org/r/933603 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [19:09:56] 10Data-Engineering-Planning, 10Data Pipelines, 10Privacy Engineering, 10Epic: Add more languages to Wikipedia Clickstream - https://phabricator.wikimedia.org/T289532 (10leila) I'm going to remove this task from the Backlog lane of the #research board given that there is no task for Research here, yet. Once... [19:09:59] 10Data-Engineering-Planning, 10Data Pipelines, 10Privacy Engineering, 10Epic: Add more languages to Wikipedia Clickstream - https://phabricator.wikimedia.org/T289532 (10leila) [19:13:16] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10SRE, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10leila) [19:13:25] 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, 10SRE, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10leila) I'm going to remove this task from the Backlog lane of the #Research board given that there is no task for Research... [19:13:59] 10Data-Engineering: Update clickstream code to support more languages - https://phabricator.wikimedia.org/T292476 (10leila) [19:14:25] 10Data-Engineering: Update clickstream code to support more languages - https://phabricator.wikimedia.org/T292476 (10leila) I'm going to remove this task from the Backlog lane of the #Research board given that there is no task for Research here, yet. Once prioritized, please reach out to us with a subtask and ad... [19:32:03] 10Data-Engineering: Re-examine how internal search referrals are handled by Clickstream - https://phabricator.wikimedia.org/T292435 (10leila) [19:32:05] 10Data-Engineering: Re-examine how internal search referrals are handled by Clickstream - https://phabricator.wikimedia.org/T292435 (10leila) I'm going to remove this task from the Backlog lane of the #Research board given that there is no task for Research here, yet. Please reach out to us with a subtask and ad... [19:40:47] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) @fkaelin as mentioned by cscott earlier, Enterprise now [[ https://dumps.wikimedia.org/other/enterprise_html/ | publishes ]] HTML dumps and we have... [19:42:50] 10Data-Engineering, 10Research: Adding data from centralauth to the lake and the mediawiki_history dataset - https://phabricator.wikimedia.org/T282657 (10leila) @Pablo can you give an update here about whether you need the output of this task for your work? (I'm reviewing tasks on our board and I'm not sure gi... [19:45:08] 10Quarry, 10Security: Session cookie should have Secure and Domain flags - https://phabricator.wikimedia.org/T214636 (10Armfns001) p:05Low→03Medium This is treated has a Medium vulnerability that needs remediation by 30 days. [19:52:41] 10Quarry, 10Security: Session cookie should have Secure and Domain flags - https://phabricator.wikimedia.org/T214636 (10Reedy) p:05Medium→03Low [19:56:46] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:08] 10Quarry, 10Security: Session cookie should have Secure and Domain flags - https://phabricator.wikimedia.org/T214636 (10Armfns001) Reedy- Is there an update on the timeline this ticket will be looked into? Thanks Balaji [19:59:32] 10Quarry, 10Security: Session cookie should have Secure and Domain flags - https://phabricator.wikimedia.org/T214636 (10rook) Am I the only one who finds humor in the suggestion that I 1615 day old ticket needs to be finished by 30 days? [20:03:14] 10Quarry, 10Security: Session cookie should have Secure and Domain flags - https://phabricator.wikimedia.org/T214636 (10Armfns001) Lol.. Let me provide some quick fire fuel to dust off. Thanks Balaji [20:29:01] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854 (10Ottomata) WIP patch for doing this ^. This does not implement conversion to the Ti... [20:37:46] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: EventBus should set dt fields with greater precision than second - https://phabricator.wikimedia.org/T340067 (10Ottomata) [21:55:47] 10Data-Engineering, 10Data Pipelines, 10Patch-For-Review: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10nshahquinn-wmf) Thank you for working on this, @BTullis. I would love to see this completed! For me, the Mamba solver has not only sped up install... [22:48:39] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Create Turnilo/Superset dashboards for identifying users w/ excessive WDQS queries - https://phabricator.wikimedia.org/T338159 (10bking) We (as in @RKemper , @EBernhardson and myself ) added a dashboard during our p... [23:57:01] (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on analytics1062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed