[00:00:01] 10Data-Engineering, 10Patch-For-Review: requesting Kerbos password for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10RKemper) This should be all done. @MRaishWMF you should have received an e-mail with a temporary password; please change to a new password when you can (and also be sure t... [00:00:31] 10Data-Engineering, 10Patch-For-Review: requesting Kerberos password for mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10RKemper) [00:00:53] 10Data-Engineering: requesting Kerberos password for mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313316 (10RKemper) 05Open→03In progress [00:14:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:38:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:12] PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:48:54] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:55:18] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,session-c1506.scope,session-c1507.scope,session-c1508.scope,session-c1510.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:54:12] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:28:26] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:51:14] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:25:32] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:33:56] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:48:06] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:59:26] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:41:16] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:44:52] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:52:38] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:07:34] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:41:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:22:40] (03PS1) 10Lucas Werkmeister (WMDE): Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818438 [10:27:47] (03PS2) 10Lucas Werkmeister (WMDE): Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818438 (https://phabricator.wikimedia.org/T314130) [10:33:27] (03CR) 10Michael Große: [C: 03+2] Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818438 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [10:34:06] (03Merged) 10jenkins-bot: Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818438 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [10:34:49] (03PS1) 10Lucas Werkmeister (WMDE): Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818407 (https://phabricator.wikimedia.org/T314130) [10:35:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818407 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [10:36:21] (03Merged) 10jenkins-bot: Remove instanceof.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/818407 (https://phabricator.wikimedia.org/T314130) (owner: 10Lucas Werkmeister (WMDE)) [11:10:38] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:23:45] 10Data-Engineering, 10Data Pipelines, 10Wikidata, 10Wikidata Analytics: Some reliability metrics missing since June 20th '22 - https://phabricator.wikimedia.org/T314131 (10Michael) [12:26:45] 10Data-Engineering, 10Event-Platform, 10SRE, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) Yar, no sorry, I have had zero time to work on this. @JArguello-WMF we should find a sprint to put this into. [13:41:50] what a nasty bug: https://phabricator.wikimedia.org/T313955#8115296 [13:48:22] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:28:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:12:34] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH) [15:12:59] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10RobH) [15:20:28] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) [16:13:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:34:29] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:37:54] 10Data-Engineering, 10Community-Tech, 10Event Metrics, 10EventStreams, and 3 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10cjming) [18:48:58] hey team :] I'm looking for grafana ops monitoring stats for an-launcher1002, but failing to find them, does anybody know where they are? [19:01:35] aqu: hi! do you know when we freed up some space in an-launcher by deleting some airflow scheduler logs? [19:03:55] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:05:07] ok, I found them :] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-launcher1002&var-datasource=thanos&var-cluster=analytics&from=now-30d&to=now [19:05:23] It seems we deleted the logs on Jul 13th [19:15:17] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:49:20] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:15:16] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Investigate why airflow sensor tasks fail without sending errors - https://phabricator.wikimedia.org/T311976 (10mforns) Since we deleted some airflow logs under `an-launcher1002:/srv/analytics-airflow/logs` this issue has not happened.... [20:34:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:01] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:01:02] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:07:38] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:18:28] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:02:36] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring