[01:21:44] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:56:06] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:01:58] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:10:38] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:22:26] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:23:12] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [09:54:17] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:43:21] (03PS24) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [10:48:53] (03PS25) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [11:02:11] (03PS26) 10Nmaphophe: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 [11:59:09] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:10:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec... [12:17:05] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye [12:31:55] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:32:54] (03CR) 10Gehel: Fix Array UDFs (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [12:38:52] (03PS27) 10Gehel: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [12:47:43] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec... [12:48:38] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye [12:51:22] (03PS28) 10Gehel: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [12:56:17] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:31] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:58] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson - sorry to trouble you with this old ticket, but I'm having an issue with three of these new an-presto hosts. * an-presto... [13:13:07] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye exec... [13:16:44] (03CR) 10Joal: "One nit before merging" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:19:27] btullis: o/ here when you want [13:19:33] (03PS29) 10Gehel: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:19:49] (03CR) 10Gehel: Fix Array UDFs (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:20:51] (03CR) 10Joal: [C: 03+2] "Awesome refactoring :) Thank you Ntsako and Guillaume :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:20:53] (03CR) 10Gehel: Fix Array UDFs (038 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:31:57] (03Merged) 10jenkins-bot: Fix Array UDFs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/828566 (owner: 10Nmaphophe) [13:40:20] 10Data-Engineering: Fix `refinery-drop-older-than` script for end-of-month/end-of-year - https://phabricator.wikimedia.org/T316746 (10mforns) I think it does belong to data pipelines. Probably infrastructure, since it's the tool (and not any particular job) that is failing. It's not tech debt, it's a bug that wi... [13:53:31] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 01), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) > "this revision has a slot that will be updated later, check <> for it" This soun... [14:31:32] 10Data-Engineering: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040 (10Urbanecm) [14:59:47] 10Data-Engineering, 10Discovery-Search, 10Platform Engineering: Coordinate with Platform Engineering / Data Value Stream Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317046 (10Gehel) [15:17:24] 10Data-Engineering, 10SRE, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) a:05Jelto→03pfischer [16:05:31] milimetric: I have created this: https://phabricator.wikimedia.org/T317053 to track the DataHub issue. It's unbreak-now, so I'll work on it with a high priority. [16:06:23] btullis: I'll sync with milimetric on this, but my assumption is that the daily reindexes need an airflow deploy that has not happened [16:07:31] 10Data-Engineering, 10Data Pipelines: Fix `refinery-drop-older-than` script for end-of-month/end-of-year - https://phabricator.wikimedia.org/T316746 (10JArguello-WMF) [16:08:28] joal: Thanks. I think that it's more of an intrisic problem with version 0.8.43. The MAE (metadata audit event) consumer seems to be having trouble reading from the GMS component. [16:09:11] 10Data-Engineering, 10Data Pipelines: Fix `refinery-drop-older-than` script for end-of-month/end-of-year - https://phabricator.wikimedia.org/T316746 (10JArguello-WMF) p:05Triage→03High [16:09:50] 10Data-Engineering, 10Data Pipelines: Fix `refinery-drop-older-than` script for end-of-month/end-of-year - https://phabricator.wikimedia.org/T316746 (10JArguello-WMF) @EChetty Let's discuss this one during next planning, please. [16:12:45] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:13:56] 10Data-Engineering, 10Discovery-Search, 10Platform Engineering: Coordinate with Platform Engineering / Data Value Stream Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317046 (10JArguello-WMF) @EChetty In what value stream does this one go? [16:16:37] (03CR) 10Phuedx: [C: 03+1] mediawiki/client/error: rename 'tags' to 'error_context' [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/829295 (https://phabricator.wikimedia.org/T316992) (owner: 10Gergő Tisza) [16:17:36] 10Data-Engineering: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040 (10JArguello-WMF) Thanks for submitting your request. Let me check with our Product Manager @EChetty s and I'll write back with the next steps. [16:26:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:23] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:32:20] hi joal, yeah, I think the airflow deploy with the lineage plugin and all that stuff is a separate issue [16:32:37] this is just a regression from the 0.8.34 version (I think that's what we were running before) [16:33:05] there might just be a configuration change or something like that, but I couldn't see anything in the release notes [16:34:25] weird milimetric [16:34:55] milimetric: IIRC the airflow change has not been deployed last week is my assumption - can you confirm it has? [16:35:20] mforns: Heya - Would take a look at Nj review please? [16:35:35] joal: sure! [16:35:40] mforns: I did what I could but your eagle airflow eye would be greatly appreciated here :) [16:35:57] no problem joal! will look [16:35:58] mforns: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/134 [16:36:06] thanks [16:37:19] joal: which airflow change specifically are you talking about? [16:37:48] the one that would make the airflow job use a new artifact to load the data? [16:37:51] milimetric: [16:38:15] milimetric: you have built/publish a new artifact for the new datahub version [16:38:32] milimetric: but has the airflow job been instructed to use it? [16:38:47] joal: right, I'm actually not 100% sure what state that's in, but the new version of DataHub is failing with manual ingestions, using the correct new version of the CLI [16:39:09] OoooOOOOOOoooh ! ok - not good then [16:39:22] yep, so I'll sort out the airflow job after we figure out what's going on with this too [16:39:25] I get it now - sorry about the misunderstanding milimetric [16:39:34] thanks for the clarification [16:47:11] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:00:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:45] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:21:11] There is a tentative fix for the datahub issue: https://github.com/datahub-project/datahub/pull/5827 [17:21:11] I can try to backport this to our branch tomorrow, whilst we wait for them to release it. [17:23:02] The indices will sort themselves out once the mae consumer starts properly, I'm pretty sure. It's not just manual ingestions that are affected, it'll be any metadata changes. [17:32:59] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:07:23] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:08:45] 10Data-Engineering: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040 (10Urbanecm) [18:09:59] 10Data-Engineering: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040 (10Urbanecm) >>! In T317040#8212105, @JArguello-WMF wrote: > Thanks for submitting your request. Let me check with our Product Manager @EChetty s and I'll write back with the nex... [19:50:29] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:56:18] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10codebug) [20:24:53] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:59:09] (03PS1) 10Mforns: Migrate unique devices queries to SparkSql and move to /hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) [21:00:50] (03CR) 10Mforns: "I've checked that all queries work in Spark3," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/829862 (https://phabricator.wikimedia.org/T305841) (owner: 10Mforns)