[00:30:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:34:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:35:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:25] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10gmodena) a:03gmodena [07:35:58] 10Data-Engineering, 10Machine-Learning-Team, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10elukey) @jbond I ran `systemctl reset-failed kafkatee.service` since the unit is marked as masked, IIRC we use only the `kafkatee-webrequest` unit in t... [08:22:57] !log Failover hive servers to standby server: https://gerrit.wikimedia.org/r/c/operations/dns/+/892460 [08:22:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:08] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JMeybohm) @Ottomata could you please add an estimation of the compute resources this will require from... [09:31:59] 10Data-Engineering, 10Patch-For-Review: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) coord1001 hive metastore and hiveserver2 restarted. [09:32:22] !log restarted hive-metastore and hiveserver2 on an-coord1001 (non-active hive server) [09:32:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:38:42] !log Failover hive servers to active server: an-coord1001 [09:38:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:42:49] !log restart presto prod coordinator to take in account heap size change [09:42:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:48:02] (03CR) 10Gehel: [C: 03+1] "LGTM" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [09:52:18] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 2 others: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10hashar) Good news, the page `Cookie (informatique)` no more shows up the top reads of t... [10:32:46] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 2 others: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10Urbanecm) >>! In T313114#8651820, @hashar wrote: > Good news, the page `Cookie (informa... [10:39:27] We are about to start the upgrade of airflow on an-test-client1001 [10:39:51] We have disabled puppet on all airflow instance for now to make sure that they are all noops. [10:43:22] btullis: o/ dse on k8s 1.23 :) [10:43:37] elukey: Fabulous [10:44:45] 👏 [10:44:53] gmodena: o/ The flink operator is now running on DSE, not sure what else needs to be deployed for the demo [10:46:06] elukey ack! We need to redeploy our app; I'll take a look. [10:46:23] cc ^ ottomata [10:48:31] gmodena: ahh I see now, I can take care of it if you want [10:49:47] done :) [10:50:20] the flink pod is being created [10:50:29] aaand running gmodena [10:50:45] elukey ahaha. You are too fast :D. I was checking doc on how to deploy [10:51:02] elukey thanks! [10:51:06] np! [10:51:11] I see a taskmanager and another pod [10:52:23] elukey this sprint I would like to get a bit more hands on with deployments. Would it be ok if I try start/stop deploy the app myself? I'd like to do it in a time window where there are SREs or other responsible adults around :) [10:52:44] elukey that tracks. The application should start a taskmanager and a jobmanager pod [10:53:09] gmodena: of course! I am not sure how you folks do stop/start, I guess setting some values.yaml flags and deploy? [10:53:15] if so the procedure is very simple [10:53:20] 1) ssh to deploy1002 [10:54:05] 2) cd /srv/deployment-charts/helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment/ [10:54:07] elukey I started to document the lifecycle management process at https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Pyflink_Enrichment_Service_Deployment. Will move to wikitech this sprint. [10:54:17] 3) helmfile -e dse-k8s-eqiad diff/sync [10:54:45] ack super [10:55:06] you can ping me/Tobias/Ben/etc.. if you need any support [10:55:18] (we are sharing the support for DSE) [10:55:23] elukey will do! Thanks [10:56:25] q: what is best practice for altering a key in a value file? Do you usually tamper with configs locally, or go through CRs? E.g. one CR to change state from running to suspended, and a followup to set suspended -> running ? [10:57:18] gmodena: in theory we use a gitops-like approach, so cr/merge/deploy.. but we can do things manually if there is an immediate need or stuff on fire [10:59:17] elukey ack. gitops it is; just wanted to make sure I was following / documenting the right procces. [11:03:41] 10Data-Engineering, 10Machine-Learning-Team, 10Observability-Logging: centrallog1002: failed to start kafkatee - https://phabricator.wikimedia.org/T330654 (10jbond) 05In progress→03Resolved > Is there a reason to change it for this particular use case? (To better understand what's happening) no i think... [11:15:42] 10Data-Engineering, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10Ladsgroup) @leila I totally agreed that it's a complex problem but I hope we can move forward with improving quality of our data and metric in some shape or form. Regardin... [11:32:39] !log merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [11:32:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:38:25] !log cancelled merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [11:38:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:46:17] 10Quarry, 10Patch-For-Review: Make available more options for number of shown rows of resultset (Quarry) - https://phabricator.wikimedia.org/T126540 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/4 [11:47:11] 10Quarry, 10Patch-For-Review: Make available more options for number of shown rows of resultset (Quarry) - https://phabricator.wikimedia.org/T126540 (10rook) @samuelguebo This works great, thanks! [11:47:40] 10Quarry, 10Patch-For-Review: Make available more options for number of shown rows of resultset (Quarry) - https://phabricator.wikimedia.org/T126540 (10rook) 05Open→03Resolved a:03samuelguebo [12:15:41] 10Quarry, 10Patch-For-Review: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/16 [12:16:09] 10Quarry, 10Patch-For-Review: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464 (10rook) pr 16 works great! Thanks @taavi ! [12:16:18] 10Quarry, 10Patch-For-Review: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464 (10rook) 05Open→03Resolved [13:12:54] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [13:12:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) [13:16:05] btullis: airflow upgrade no good? :) [13:33:54] ottomata: o/ is the value stream meeting going to be recorded? [13:33:59] i think so? [13:34:17] okok because I have some conflicting meetings [13:34:23] so I was wondering :) [13:38:58] i don't know but chances are very good :) [13:39:39] (03PS2) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889512 [13:46:47] ottomata: no there is an issue with the new airflow package on some missing python packages/dependency loop in conda [13:47:02] ah hm [14:21:54] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics, and 2 others: Analyze possible bot traffic for frwiki article Cookie (informatique) - https://phabricator.wikimedia.org/T313114 (10hashar) I don't get it from the Wikipedia Android app (r/2.7.50426-r-2022-12-08 built b... [14:29:30] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) @gmodena you had some estimations somewhere, right? [14:49:06] 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10gmodena) @JMeybohm @Ottomata this page contains metrics re compute and memory resources for the applica... [15:15:16] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [15:22:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) I added another downtime for the next 28 days. Sorry for the disruption. [15:51:25] 10Data-Engineering, 10Event-Platform Value Stream: EventStreamCatalog removes 'topic' table option if connector = upsert-kafka - https://phabricator.wikimedia.org/T330769 (10Ottomata) [17:46:36] 10Data-Engineering, 10Foundational Technology Requests, 10Product-Analytics: "Source of truth" dataset for pageviews - https://phabricator.wikimedia.org/T310732 (10Mayakp.wiki) p:05High→03Low Changing this to Medium, as we have workarounds in place and have been using the corrected pageviews in several r... [17:48:19] 10Data-Engineering, 10Research-Backlog: [Open question] Improve bot identification at scale - https://phabricator.wikimedia.org/T138207 (10SNowick_WMF) [17:48:22] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) 05Open→03Resolved [17:51:47] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 74 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdforrester-WMF) [17:55:39] Hey mforns - would you have some time for me before our next meeting? [17:57:33] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [17:57:55] 10Data-Engineering, 10Advanced-Search, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 73 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [18:11:39] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10nettrom_WMF) This task came up in the Product Analytics team's board refinement mee... [18:25:13] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate export_queries_to_relforge.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329871 (10EBernhardson) a:03EBernhardson [19:11:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10gmodena) [19:12:43] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10gmodena) > Should this ticket cover upgrades to other things as well (Flink, Python?) I think those are orthogonal to application upgrades. Changi... [19:13:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10gmodena) [19:23:15] mforns: I updated my edit_hourly ariflow MR for when you have time: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/237 [19:24:11] mforns: I think it is ready - we need both refinery+airflow deploy for this - let me know if ou plan on deploying this week (shoudln't it be SandraEbele ?) [19:31:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10Ottomata) > It seems Flink Kafka sources emit KafkaConsumer metrics, but Flink Kafka sinks do not emit KafkaProducer metrics? Oh, I got a [[ https://lists.apache.... [20:21:25] (03CR) 10Joal: "Comments about comments! Thanks a lot Antoine for our persistence on this patch <3" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [21:33:09] !log Deploying section_image_recommendations DAG to platform_eng Airflow instance [21:33:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log