[01:02:40] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10BPirkle) [01:08:09] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10BPirkle) [01:12:20] 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10BPirkle) [01:30:58] (03PS1) 10Gerrit maintenance bot: Add gur.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/882784 (https://phabricator.wikimedia.org/T327842) [06:17:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Wikidata: Realtime Wikibase editing UI and API - https://phabricator.wikimedia.org/T298305 (10Lectrician1) [08:47:52] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10gmodena) [10:13:06] 10Analytics: jmx_presto prometheus job down for some an-presto hosts - https://phabricator.wikimedia.org/T327753 (10fgiunchedi) Thank you @Stevemunene ! I have acked the alert [10:48:33] 10Data-Engineering, 10Data-Catalog, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) [10:52:00] 10Data-Engineering, 10Data-Catalog, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) p:05Triage→03High Expediting this into the current sprint, since it is currently blocking newer sta... [11:26:54] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) [11:32:57] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) 05Open→03Resolved This is complete @BTullis [11:33:28] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10BTullis) Great. Many thanks. [12:03:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) Thanks @Stevemunene - that's good research. however, I'm not sure that it's totally applicable in our situation as they're... [12:19:52] (03PS2) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/881824 [13:21:57] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next dpeloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/882784 (https://phabricator.wikimedia.org/T327842) (owner: 10Gerrit maintenance bot) [13:30:15] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should support row state schema evolution - https://phabricator.wikimedia.org/T327900 (10gmodena) [13:38:52] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10EChetty) [13:40:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10EChetty) TB: End of this sprint. (End of next week) [13:41:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10EChetty) p:05Medium→03High [13:44:56] 10Data-Engineering, 10Pageviews-Anomaly, 10Security-Team, 10Wikipedia-Android-App-Backlog, and 3 others: Add en.wiki articles to wikifeeds topviewed exemption list - https://phabricator.wikimedia.org/T327904 (10Seddon) [13:46:17] joal: steve_munene: Should we find some time for a call to discuss presto? T323783 [13:46:18] T323783: Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 [13:46:39] I would like to! I have time now - how about you folks? [13:47:15] joal: Sure thing. [13:47:19] same here [13:47:29] batcave! [13:47:35] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10Ottomata) I think a user specifying latest for sources is okay. Or, perhaps a major version compatibility (although that would be annoying to... [13:48:00] 10Data-Engineering-Planning, 10Pageviews-Anomaly: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10EChetty) [13:49:18] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10EChetty) [13:49:42] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10EChetty) [13:49:55] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 07): Spark Streaming Dumps POC: Backfill metadata table - https://phabricator.wikimedia.org/T323642 (10EChetty) [13:51:04] 10Data-Engineering-Planning, 10Platform Engineering: Modify HiveToDruid Job - https://phabricator.wikimedia.org/T302514 (10EChetty) 05Open→03Declined [13:51:08] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: The network_internal druid load job fails if data is not present - https://phabricator.wikimedia.org/T302263 (10EChetty) [13:55:29] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10gmodena) > I think a user specifying latest for sources is okay. Or, perhaps a major version compatibility (although that would be annoying to... [14:01:10] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) Here are some configuration parameters that we believe are going to be useful for testing this: From this page: https://a... [14:07:11] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10JAllemandou) > * `node-scheduler.max-splits-per- node` 150 (default 100) Let's try with `500` as shown in the code example instead... [14:17:02] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) Also * `query.max-total-memory-per-node` increase from 24GB to 40 GB * `query.max-memory-per-node`: increase 12GB to 20 G... [14:20:32] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10BTullis) >>! In T323783#8557390, @JAllemandou wrote: >> * `node-scheduler.max-splits-per- node` 150 (default 100) > Let's try with... [14:49:42] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should support row state schema evolution - https://phabricator.wikimedia.org/T327900 (10JArguello-WMF) [14:50:02] a-team: I'm about to kick off a rolling reboot of all hadoop workers to pick up a new kernel. I'll be on the lookout for any failed jobs that result. [14:52:11] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. - https://phabricator.wikimedia.org/T327494 (10JArguello-WMF) [14:52:25] (03CR) 10Milimetric: GDI Equity Landscape Tables/Scripts (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/881824 (owner: 10Nmaphophe) [14:53:14] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10JArguello-WMF) [14:54:44] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10JArguello-WMF) [14:54:50] !log started a rolling-reboot of the hadoop workers via `sre.hadoop.reboot-workers` cookbook. [14:54:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:56:05] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10JArguello-WMF) [14:57:36] 10Data-Engineering-Planning, 10Data-Catalog, 10Event-Platform Value Stream: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10JArguello-WMF) [14:58:07] 10Data-Engineering-Planning, 10Data-Catalog, 10Event-Platform Value Stream: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10JArguello-WMF) [14:58:29] thx btullis, lemme know if I can help kick the tires [14:58:48] milimetric: Will do 👍 [14:59:13] 10Data-Engineering-Planning, 10Data-Catalog, 10Event-Platform Value Stream: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10JArguello-WMF) [15:03:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:03:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:07:10] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10JArguello-WMF) [15:08:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:08:42] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:08:45] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10JArguello-WMF) [15:11:01] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. - https://phabricator.wikimedia.org/T327494 (10Ottomata) [15:11:30] 10Data-Engineering, 10Event-Platform Value Stream: Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. - https://phabricator.wikimedia.org/T327494 (10JArguello-WMF) [15:12:29] 10Data-Engineering, 10Event-Platform Value Stream: Set PYTHONPATH and FLINK_CLASSPATH in Flink docker images. - https://phabricator.wikimedia.org/T327494 (10JArguello-WMF) [15:12:56] 10Data-Engineering, 10Event-Platform Value Stream: Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) [15:14:11] !log rebooting an-conf1003 for new kernel [15:14:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:15:12] 10Data-Engineering, 10Event-Platform Value Stream: Q4- eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10JArguello-WMF) [15:15:34] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10Ottomata) [15:22:18] (03PS6) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) [15:22:40] 10Data-Engineering, 10Event-Platform Value Stream: Q3 Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10JArguello-WMF) p:05Triage→03Medium [15:22:59] (03CR) 10CI reject: [V: 04-1] Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer) [15:23:14] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) p:05Medium→03Triage [15:24:46] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10JArguello-WMF) [15:25:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10lbowmaker) [15:25:43] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10JArguello-WMF) [15:31:10] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10JArguello-WMF) [15:37:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): eventutilities-python source and destination stream must be versioned - https://phabricator.wikimedia.org/T327866 (10Ottomata) > I would not want source / destination to have different behaviour though. That can get confusing for end users and ops.... [15:41:04] 10Data-Engineering, 10Event-Platform Value Stream: Q4 eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10JArguello-WMF) [15:41:54] 10Data-Engineering, 10Event-Platform Value Stream: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10JArguello-WMF) [15:45:27] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): eventutilities-python should support nested row type info - https://phabricator.wikimedia.org/T327900 (10Ottomata) [15:45:45] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) [15:45:49] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10Ottomata) [15:46:47] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 9 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [15:48:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Update event schema for Kartographer external data (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/882631 (https://phabricator.wikimedia.org/T326637) (owner: 10Awight) [15:50:09] 10Data-Engineering, 10Equity-Landscape: Milestone: Dashboard Template Complete - https://phabricator.wikimedia.org/T305479 (10CMacholan) a:03okwiri_oduor [15:55:53] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I'll check our db-related hosts and I'll get back to you tomorrow [16:04:30] 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1080, an-worker1084, and an-worker1086 - https://phabricator.wikimedia.org/T325984 (10BTullis) 05Open→03Resolved [16:23:59] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub errors in staging-codfw - https://phabricator.wikimedia.org/T327799 (10BTullis) I tried adding reverse DNS entries for the staging-codfw cluster, since this was a difference between it and the staging-e... [16:32:33] PROBLEM - IPMI Sensor Status on an-worker1080 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:36:22] 10Data-Engineering, 10Data-Catalog, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10BTullis) I have asked DataHub themselves about this and I am currently awaiting a reply. [16:48:24] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 (10Stevemunene) Updated the Presto server acting as coordinator and the presto servers acting as worker nodes configs with the tuning... [16:53:57] !log kicked off a rolling reboot of kafka-jumbo as part of T325132 [16:53:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:57] !log Restarting presto-server.service on presto coordinator an-coord1001 for T323783 [16:54:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:59] T323783: Add an-presto10[06-15] to the presto cluster - https://phabricator.wikimedia.org/T323783 [16:58:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [16:59:03] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2031 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2031%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:04:41] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) No response from mailing list yet, but really, the pip installed flink just works better. I'm going to submit a patch to production-images. [17:15:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:45] (03PS3) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/881824 [17:47:27] RECOVERY - Presto Server on an-presto1009 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:47:29] RECOVERY - Presto Server on an-presto1007 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:47:31] RECOVERY - Check systemd state on an-presto1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:35] RECOVERY - Presto Server on an-presto1012 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:47:49] RECOVERY - Presto Server on an-presto1010 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:47:53] RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:55] RECOVERY - Check systemd state on an-presto1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:57] RECOVERY - Check systemd state on an-presto1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:59] RECOVERY - Presto Server on an-presto1008 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:48:07] RECOVERY - Presto Server on an-presto1015 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:48:25] RECOVERY - Check systemd state on an-presto1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:27] RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:31] RECOVERY - Check systemd state on an-presto1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:37] RECOVERY - Presto Server on an-presto1011 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:48:45] RECOVERY - Presto Server on an-presto1013 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [17:48:49] RECOVERY - Check systemd state on an-presto1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:02] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/881824 (owner: 10Nmaphophe) [18:42:09] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:45:12] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:48:01] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:50:35] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski) [19:04:39] RECOVERY - IPMI Sensor Status on an-worker1080 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:29:47] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:02:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:02:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:07:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp6011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:07:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp6011 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6011%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:20:52] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:20] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:23:06] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:23:16] (03PS1) 10Milimetric: Get comment from comment table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883659 (https://phabricator.wikimedia.org/T326330) [21:54:56] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:00:13] 10Data-Engineering-Planning: Emit lineage information about Airflow jobs to DataHub - https://phabricator.wikimedia.org/T312566 (10Milimetric) >>! In T312566#8542940, @Antoine_Quhen wrote: > lmkwyt First of all, thank you!! > * using the datahub operator in place of the complete automation at first because thi... [22:47:54] Heya, does anyone know if the graph image exported from https://w.wiki/6GBq (superset, 'Wikipedia web edits by interface', last year by month) is considered private (or can it be shared/uploaded to commons)? [23:38:17] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) [23:46:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) [23:47:18] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) [23:47:58] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) @Ottomata @bking I've tried to get most of the pieces in place that I could...