[00:17:56] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye co... [00:34:52] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye [00:43:20] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:11] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1009.eqiad.wmnet with OS bullseye co... [01:05:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye [01:35:16] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye co... [01:36:17] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) 05Open→03Resolved @btullis these have been fixed, I updated the nic firmware and re-ran the image script. [06:50:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:55:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1081 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1081%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:56:31] hello folks [06:56:46] one qs about search-drop-query-clicks - it has been failing for a long time, is it still needed? [06:56:50] (on stat1007) [08:42:38] Good question elukey. I know nothing about this job, nor in fact why it would run on stat1007. I can ask around a bit too. [08:44:49] btullis: I recall that we added it years ago since we didn't know exactly where to put it, IIRC it is owned by product analytics [08:56:08] Ah, yes I see. Looks like notifications go here: https://lists.wikimedia.org/hyperkitty/list/discovery-alerts@lists.wikimedia.org/message/OGGV4IHN4YAIF3NMD3KLQXBHLKQLYOXE/ and Aug 17 was the last one for this service. [08:57:30] https://www.irccloud.com/pastebin/bsVsY7vH/ [09:08:15] I've asked in product analytics on Slack: https://wikimedia.slack.com/archives/CLKDS4MG9/p1666170467429389 [09:46:09] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Iluvatar) Thanks. Today this bug does not exist. Even when reconnect through disable/enable network controller. You fixed it. :) I will test for a... [10:11:20] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Prepare the fsimage - https://phabricator.wikimedia.org/T321167 (10EChetty) [10:22:51] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Create and deploy the fsimage job. - https://phabricator.wikimedia.org/T321168 (10EChetty) [10:29:09] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Extract the analysis and make it available on superset. - https://phabricator.wikimedia.org/T321169 (10EChetty) [10:29:35] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Extract the analysis and make it available on superset. - https://phabricator.wikimedia.org/T321169 (10EChetty) [10:29:48] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Prepare the fsimage - https://phabricator.wikimedia.org/T321167 (10EChetty) [10:29:56] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03), 10Technical-Debt: Create and deploy the fsimage job. - https://phabricator.wikimedia.org/T321168 (10EChetty) [10:30:35] 10Data-Engineering-Planning, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [10:31:20] 10Data-Engineering-Planning, 10Data Pipelines, 10Patch-For-Review, 10Technical-Debt: Productionize HDFS fsimage data analysis job - https://phabricator.wikimedia.org/T261283 (10EChetty) [12:40:58] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10EventStreams: Old events in the stream - https://phabricator.wikimedia.org/T320558 (10Ottomata) 05Open→03Resolved a:03Ottomata Haha, "I" fixed it! :p Interesting. I wonder if there were indeed somehow some old events at late... [13:04:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10JArguello-WMF) [13:04:41] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10JArguello-WMF) [13:38:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10EChetty) [13:38:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10EChetty) [14:03:26] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson many thanks indeed, that's great. :+1: [14:05:27] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) Once we are ready, I think we can also deploy in just one of the stat boxes for testing on prod dat... [14:06:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:11:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:43:06] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Patch-For-Review: Change puppet jupyterhub module to point to conda-analytics - https://phabricator.wikimedia.org/T321088 (10Ottomata) Yes I think so! [16:22:33] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 03): [airflow] Normalize the use of timeouts in Airflow DAGs - https://phabricator.wikimedia.org/T317549 (10JAllemandou) >>! In T317549#8325450, @mforns wrote: > My initial conclusions would be: > > * Do not use DAG-wide `dagrun_timeout`, we already have S... [16:40:09] PROBLEM - SSH on analytics1075.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:52] !log reset the BMC on analytics1075 [17:14:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:41:04] RECOVERY - SSH on analytics1075.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:54] (03PS4) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [18:42:11] (03PS5) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [18:47:55] (03PS6) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [18:50:29] (03PS7) 10Milimetric: [WIP] Collaborate on a new editors dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/838256 [19:09:13] 10Data-Engineering: Bug: User History has mismatching order of fields in Parquet vs. Hive - https://phabricator.wikimedia.org/T321231 (10Milimetric) [20:51:23] 10Data-Engineering, 10Product-Analytics: Identify imported revisions in mediawiki_history - https://phabricator.wikimedia.org/T221482 (10Milimetric) I think we should do this. We can limit the pages we look at with the import log as Neil says, and then just mark all the revisions that have much larger revisio...