[01:02:58] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:51:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [01:56:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2028 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp2028%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:15:10] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:46:44] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:47:54] 10Analytics: Add cawiki to clickstream dataset - https://phabricator.wikimedia.org/T327982 (10Robertgarrigos) [03:07:39] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python's CSV loading cannot handle standard quoted CSV fields - https://phabricator.wikimedia.org/T327983 (10nshahquinn-wmf) [03:09:53] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python's CSV loading cannot handle standard quoted CSV fields - https://phabricator.wikimedia.org/T327983 (10nshahquinn-wmf) [03:28:07] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:59:47] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:10:19] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:11:34] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:42:21] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:13:51] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:24:25] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:30:12] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:43:06] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [06:49:19] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:51:40] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [06:53:03] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) Adding Jaime for the backup related hosts [07:16:34] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:27:06] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:28:31] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:35:24] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [07:45:26] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:04:06] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:21:51] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:22:51] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:23:41] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:25:30] Hi btullis and steve_munene - Let me know when ou have time to talk about presto scalability issue [08:37:23] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:38:25] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:40:14] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:44:35] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:49:16] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:05:17] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:05:55] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:17:03] Hi team - I'm gonna deploy refinery now [09:17:42] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/875951 (https://phabricator.wikimedia.org/T326330) (owner: 10Milimetric) [09:24:28] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883659 (https://phabricator.wikimedia.org/T326330) (owner: 10Milimetric) [09:30:25] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [09:30:32] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:35:03] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:37:03] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [09:38:06] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:39:06] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [09:39:33] (03PS1) 10Joal: Update cu_changes sqoop and hive table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883842 [09:43:26] !log Rerun failed 'cassandra_daily_load.load_mediarequest_per_file_to_cassandra 2023-01-25T00:00:00+00:00' task [09:43:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:44:08] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883842 (owner: 10Joal) [09:44:56] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:48:16] ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis About to be decommed https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:48:22] !log deploying refinery using scap (no refinery-source deploy) [09:48:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:50:49] joal: apologies, I'm on a course for a couple of hours. [09:51:03] np btullis - we'll talk when you're back :) [09:51:33] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10dcausse) I'm a bit hesitant to use the distribution packaged for pyflink as I'm not sure it's meant to be used that way for submitting production jobs, and I... [10:01:25] steve_munene, btullis: I had a failed deploy for an-test-client1001 due to no space left on device - I'll need help with that when you have time please [10:01:36] !log deploy refinery onto hdfs [10:01:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:05:31] joal: Looking now. It's mainly users' files, so e.g. lots of conda envs in peoples' homes. We can remove them, but might be better for people to clear their own. [10:05:36] joal: available to check on an-test-client1001 [10:05:54] https://usercontent.irccloud-cdn.com/file/JxgBwAiL/image.png [10:06:15] https://usercontent.irccloud-cdn.com/file/2Xscvw36/image.png [10:06:26] I will clear stuff from my own home directory now. [10:06:31] thanks btullis [10:07:29] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) We can't migrate the puppetdb2002 VM (it's being moved to baremetal, but that is unlikely completed by then), so we'll need to dis... [10:08:53] ack btullis - thanks for doing your own cleanup [10:09:49] I'll also need a reset of the /srv/deployemt/analytics/refinery folder to deploy it anew (otherwise the jars don't get downloaded correctly) - please [10:16:47] I've cleared my own space [10:16:51] https://usercontent.irccloud-cdn.com/file/2zUKFJqG/image.png [10:20:26] joal: I think that the reset is done [10:21:16] btullis: ack - I'll try a new deplo [10:22:06] btullis: it succeeded, but very fast - too fast to be ok IMO [10:23:38] OK, still stuck on course at the moment. [10:23:53] btullis: size of the `artifacts` directory on an-test-client1001: 293M - While one stat1008 it is 3.6G [10:24:49] Expected issue: when rededploying, scap finds an exisitng directory with the correct sha and doesn't pull anew - Solution should be to drop that latest folder for scap to redeploy it entirely [10:27:32] https://www.irccloud.com/pastebin/0h8XkRfZ/ [10:27:47] How about 1.2 GB? Can you try again? [10:28:33] btullis: it's not about disk-space anymore, it's about scap folders :) [10:29:04] ah sorry - didn't understand your message [10:29:58] btullis: I think easiest is to delete /srv/deployment/analytics/refinery-cache/revs/8ed8435413afe947e1a7b9a1463c296b2d9a3ab2 [10:30:48] Done. [10:31:08] ack - will redeploy now [10:31:40] Yes, I meant after I did a git status it refreshed the indexes and the size of the directory went to to 1.2 GB. But I get your point about the scap folders :-) [10:31:51] it's taking the expected long time now btullis - I think we're on a good path [10:32:33] 👍 [10:33:27] solve btullis - thanks a lot :) [10:36:03] !log drop/recreate wmf_raw.mediawiki_private_cu_changes hive table to have new fields [10:36:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:36:11] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:42:29] !log deploying airflow analytics for GDI dags [10:42:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:45:38] (03PS2) 10Awight: Update event schema for Kartographer external data [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/882631 (https://phabricator.wikimedia.org/T326637) [10:46:10] (03CR) 10Awight: Update event schema for Kartographer external data (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/882631 (https://phabricator.wikimedia.org/T326637) (owner: 10Awight) [10:52:21] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:10:26] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update event schema for Kartographer external data (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/882631 (https://phabricator.wikimedia.org/T326637) (owner: 10Awight) [11:12:53] PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100% [11:30:17] RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:16:25] 10Data-Engineering, 10Event-Platform Value Stream: Q4 eventutilities-python should bundle java deps. - https://phabricator.wikimedia.org/T327251 (10gmodena) [12:17:01] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:21:58] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10gmodena) [12:23:50] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10gmodena) [12:35:12] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [12:35:54] ^- this is me, furud didn't come back from a managed restart, so I'm trying again. [12:37:12] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [12:44:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [12:45:10] (03PS3) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [12:49:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp3051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp3051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [13:15:30] (03CR) 10CI reject: [V: 04-1] Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [13:16:55] FYI, I'm upgrading nodejs on an-tool1007/turnilo.wikimedia.org in a bit, which will involve a Turnilo restart, I tested the new nodejs on the an-tool1011 staging host and seems all fine there [13:25:02] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) > why they provide part of the flink distribution again in the python package tbh Do you mean in the example Dockerfile? `FROM flink` + `pip inst... [13:26:54] nodejs update/turnilo restart is complete [13:39:30] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:46:00] (03PS4) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [14:00:22] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:04:21] (03CR) 10Ottomata: Remove Guava from dependency (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [14:04:34] I think I understand our issue aqu [14:05:01] 10Data-Engineering: ***New Tasks Above*** - https://phabricator.wikimedia.org/T328026 (10EChetty) [14:05:31] 10Data-Engineering: ***New Tasks Above*** - https://phabricator.wikimedia.org/T328026 (10EChetty) 05Open→03Stalled [14:05:51] 10Data-Engineering-Planning: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10EChetty) [14:09:43] 10Data-Engineering, 10Equity-Landscape: Add country_meta_data - https://phabricator.wikimedia.org/T324681 (10ntsako) a:05ntsako→03JAnstee_WMF [14:16:16] (03CR) 10CI reject: [V: 04-1] Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [14:17:16] 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Dreamy_Jazz) [14:24:31] (03CR) 10Awight: Update event schema for Kartographer external data (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/882631 (https://phabricator.wikimedia.org/T326637) (owner: 10Awight) [14:53:43] joal, I read and responded to your latest comment on the Airflow Dataset change. Do you want to chat more about that before merging? Or did you mention it to have it in mind for the future? [14:53:53] hi! BTW :P [14:56:28] Hi mforns :) [14:56:32] We can talk :) [14:56:59] mforns: quick batcave? [14:57:43] joal: yes! [14:58:02] joal: can you pass me the link, please? I lost it [14:58:11] https://meet.google.com/rxb-bjxn-nip [14:59:14] mforns: --6 [15:13:32] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:27:20] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10xcollazo) [15:37:11] 10Analytics, 10Data-Engineering: Add cawiki to clickstream dataset - https://phabricator.wikimedia.org/T327982 (10Milimetric) @EChetty: The old Analytics tag should auto-tag Data-Engineering or be archived/deleted so folks can't use it. I've heard a lot of confusion around the team name lately, and I think th... [16:41:48] (03PS1) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883525 [16:46:48] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:48:11] (03PS2) 10Nmaphophe: GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883525 [16:56:27] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/883525 (owner: 10Nmaphophe) [17:02:14] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: PySpark warning messages - https://phabricator.wikimedia.org/T315024 (10Mayakp.wiki) @nettrom_WMF will add to this task about the warning messages we get when using wmfdata-python for querying MariaDB in the new conda-analytics environment. [17:05:51] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10JArguello-WMF) [17:06:07] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 07): Build Druid Operator - https://phabricator.wikimedia.org/T309996 (10JArguello-WMF) [17:06:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6007 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp6007%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:06:41] (VarnishkafkaNoMessages) firing: varnishkafka on cp6007 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp6007%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:11:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp6007 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp6007%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:11:41] (VarnishkafkaNoMessages) resolved: varnishkafka on cp6007 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp6007%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:14:51] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Eevans) [17:18:05] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:20:27] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub errors in staging-codfw - https://phabricator.wikimedia.org/T327799 (10JMeybohm) I did not find any real clues as well. What I do see is that GMS does get killed at random stages during startup and the logs do not differ on... [17:26:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) Talked to David in IRC, we are going to give the pyflink based image a go, as long as we include some plugin .jars we need in opt/. Namely the s3... [17:34:01] 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Datahub errors in staging-codfw - https://phabricator.wikimedia.org/T327799 (10BTullis) Is this the last blocker to upgrading staging-eqiad to 1.23 @JMeybohm ? If so, I wonder whether we should proceed with the upgrade? This would... [17:38:45] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:03:47] (03PS5) 10Joal: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [18:10:47] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:11:11] btullis, steve_munene : maybe tomorrow we'll find some time to investigate presto a bit more? [18:14:45] joal: yes we can set aside some time. [18:16:46] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:17:23] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [18:49:04] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: PySpark warning messages - https://phabricator.wikimedia.org/T315024 (10nshahquinn-wmf) @Mayakp.wiki oh, we already have a task for that: T324135. I think that's useful since the source and potential responses to the warnings are different. [18:50:44] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) [18:50:58] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) [19:04:23] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:35:56] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:46:18] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:54:28] (03CR) 10Dbrant: [C: 03+2] Android schema migration: MobileWikiAppCreateAccount Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppCreateAccount [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881912 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [19:55:03] (03Merged) 10jenkins-bot: Android schema migration: MobileWikiAppCreateAccount Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppCreateAccount [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881912 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [20:04:59] (03CR) 10Dbrant: [C: 03+2] Android schema migration: MobileWikiAppInstallReferrer [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881915 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [20:05:44] (03Merged) 10jenkins-bot: Android schema migration: MobileWikiAppInstallReferrer [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881915 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [20:14:26] (03CR) 10Dbrant: [C: 03+2] Android schema migration: MobileWikiAppReadingLists Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppReadingLists [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881917 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [20:14:59] (03Merged) 10jenkins-bot: Android schema migration: MobileWikiAppReadingLists Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppReadingLists [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881917 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [20:18:22] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:37:51] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 08): Flink docker image should work with pyflink - https://phabricator.wikimedia.org/T327494 (10Ottomata) Okay @gmodena @dcausse pyflink based flink-1.16.0-wmf4 image published. flink-fs-s3-presto is in /opt/flink/opt. [20:40:31] 10Data-Engineering, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [20:49:16] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10ci-test-error: Use a fake timer in EventBus unit test for PageChangeEventSerializerTest::testCreatePageChangeVisibilityEvent - https://phabricator.wikimedia.org/T325341 (10Ottomata) @Ladsgroup @Umherirrender did my change help? [21:03:03] (03PS1) 10Ottomata: development/mediawiki/page/chage - 2.0.0 - remove comment_html [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/884088 (https://phabricator.wikimedia.org/T327065) [21:21:16] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:43:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10gmodena) [21:44:12] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07): Deployment pipeline docker image of flink mediawiki stream enrichment pyhon - https://phabricator.wikimedia.org/T326731 (10gmodena) [22:02:44] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:18:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 07), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) TODO: - Removing comment_html: [[ https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/8... [22:23:31] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:37:25] (03PS1) 10Sharvaniharan: Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) [22:37:53] (03CR) 10CI reject: [V: 04-1] Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [22:39:26] (03PS2) 10Sharvaniharan: Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) [22:39:56] (03CR) 10CI reject: [V: 04-1] Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [22:45:14] (03PS3) 10Sharvaniharan: Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) [22:49:19] (03Abandoned) 10Sharvaniharan: Android schema migration: MobileWikiAppSessions Legacy schema: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppSessions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/881750 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [22:51:15] (03CR) 10Sharvaniharan: "Made a minor change from this : https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/881750" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/884105 (https://phabricator.wikimedia.org/T324167) (owner: 10Sharvaniharan) [23:25:54] PROBLEM - MegaRAID on an-worker1087 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:38:37] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: Creating a Spark session causes a torrent of log spam - https://phabricator.wikimedia.org/T315024 (10nshahquinn-wmf) [23:41:34] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) [23:46:25] RECOVERY - MegaRAID on an-worker1087 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring