[01:08:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:34:13] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:45:39] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:19:57] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:25:37] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:11:01] 10Analytics: Wikipedia in Chromium constantly throws exception due to the Kaspersky browser extension - https://phabricator.wikimedia.org/T314274 (10Sunpriat2) [08:45:35] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:42:49] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:17:05] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:02:54] 10Analytics: Wikipedia in Chromium constantly throws exception due to the Kaspersky browser extension - https://phabricator.wikimedia.org/T314274 (10Aklapper) See https://support.google.com/chrome/thread/2047906/unchecked-runtime-lasterror-the-message-port-closed-before-a-response-was-received - please point to... [12:11:11] (03PS1) 10Phuedx: mediawiki/client/metrics_event: Make custom_data a map type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/819043 (https://phabricator.wikimedia.org/T314151) [13:12:11] 10Data-Engineering, 10Community-Tech, 10Event Metrics, 10EventStreams, and 3 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10EChetty) [13:12:47] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10EChetty) [13:22:02] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 (10ayounsi) @BTullis could it be possible to fix https://turnilo.wikimedia.org/#network_flows_internal/ until the overall issue is fixed? [14:09:04] 10Data-Engineering, 10Event Metrics, 10EventStreams, 10Growth-Team, and 2 others: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173 (10dmaza) [14:13:13] hiiiii! quick question... if I have a long-running process in a Jupyter notebook on a stat machine, is it expected that this will prevent others' tasks from running? should I re-nice the job, and if so, any guidance on what to set it to? [14:14:12] 10Analytics: Wikipedia in Chromium constantly throws exception due to the Kaspersky browser extension - https://phabricator.wikimedia.org/T314274 (10Sunpriat2) >>! In T314274#8119041, @Aklapper wrote: > What happens when using https://en.wikipedia.org/wiki/Main_Page?debug=true ? all the same {F35375423} [14:15:01] Hi AndyRussG: - I don't believe that this will prevent any other users' tasks from running. It should be completely fair scheduling. Which stat machine are you on, out of interest? I can see if it's a particularly heavy process if you like. [14:15:27] btullis: thanks!!! it's stat1005 [14:16:05] I did get a request to renice the job running in the notebook... maybe something about the library I'm using in the notebook makes it hog all the cores or something? [14:17:11] process has been running since last night and I'd love to keep it going during today, so definitely if some action is appropriate for resource sharing, I should do that..... :) [14:17:22] Oh I see it. Yep, it's certainly big. :-) [14:18:55] If you just use `nice 9406` it will by default add 10 to the niceness value. It's currently on the default of 0 niceness, as with almost all other processes. [14:19:06] Just to clarify, I was the one with the "renice" the request. As a general practice, any job that uses all cores and is running for more than 1 hour, should be "reniced", to ensure the priority is appropriately set and other jobs can run in harmony. I hope this is something that would help everyone using the stat machines! [14:19:32] hugely appreciate u reaching out about this btw aarora :) :) [14:19:51] :+1 [14:19:56] Yep, many thanks both. [14:21:42] @Ben, there's another issue as well. "Long-running" jobs from another user "jiawang" are consuming a lot of RAM, and the current usage is almost 300 GB. It would be great if there was a way to reach out to the user and request to release the resources if possible. Thanks! [14:21:59] btullis: oh btw 9406 isn't my job... the pid for my conda environment running the jupyter server is 9908 I think... my username is andyrussg [14:23:26] K just did this: [14:23:29] $ renice +10 9908 [14:23:31] 9908 (process ID) old priority 0, new priority 10 [14:23:46] aarora: OK, I'll reach out to the user. Thanks for the heads-up. [14:24:07] thanks again both! [14:24:09] thank u btullis! :) [14:25:08] hmmm right it looks like for some times my process goes up to 3300% CPU usage (so I guess that means, yeah, many cores) and other times not [14:25:18] Andy, I guess you need to renice another job, I don't think it properly reniced. It seems you used the pid of a child process, and not the main process that launched other threads. [14:25:23] I wonder if for future occasions there's a way to limit how many cores it'll use [14:25:59] If I see it correctly, the parent process pid is 35993 [14:27:51] aarora: yes one sec [14:28:47] aarora: that's actually a child process, in that the parent process would be my jupyter server, I think? still yes I'll try renicing that one [14:29:31] for the curious, here's the notebook I'm running: https://gitlab.wikimedia.org/andyrussg/jupyter_notebooks/-/blob/master/prophet_trends.ipynb [14:29:58] $ renice +10 35993 [14:30:01] 35993 (process ID) old priority 0, new priority 10 [14:30:49] 10Analytics: Wikipedia in Chromium constantly throws exception due to the Kaspersky browser extension - https://phabricator.wikimedia.org/T314274 (10Sunpriat2) 05Open→03Invalid [14:30:49] aarora: are you now able to run your processes better? [14:31:44] yes, I think it should be better now. I will launch my process in a while, but this should do the trick. Thanks again! [14:32:22] aarora: thank u, lmk if u still see issues... in my notebook I can see that things are still chugging along [14:33:33] aaand wmf gitlab is down.... [14:35:06] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.807 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:46:56] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.7399 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:51:47] 10Analytics: Wikipedia in Chromium constantly throws exception due to the Kaspersky browser extension - https://phabricator.wikimedia.org/T314274 (10Sunpriat2) [14:53:01] 10Data-Engineering, 10Event-Platform, 10SRE, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) [14:53:47] 10Data-Engineering, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10Milimetric) @BPirkle: good question, multi-part answer: First, everything in the [[ https://gerrit.wikimedia.org/g/analytics/aqs/+/9c062f255237170c1fd7c568dbfb825076b11258/s... [15:19:08] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 02): Migrate the projectview jobs - https://phabricator.wikimedia.org/T305844 (10EChetty) [15:50:44] (03CR) 10Milimetric: [C: 03+2] "This change looks fine. For deployment, I'm not 100% clear on whether we need to upload UAParser 1.5.3 separately to Archiva in order for" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/818083 (https://phabricator.wikimedia.org/T306829) (owner: 10Aqu) [16:01:13] (03Merged) 10jenkins-bot: Update ua-parser library [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/818083 (https://phabricator.wikimedia.org/T306829) (owner: 10Aqu) [16:18:08] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10EChetty) Im happy for you to rename it :) [16:40:40] 10Data-Engineering, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) >>! In T311190#8119683, @Milimetric wrote: > In some cases, the link is obvious but in other cases it's indirect. Like, for example, the **Bytes difference** endpoi... [20:10:57] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [20:12:05] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [20:12:08] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10RobH) [21:27:48] 10Analytics-Radar, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352 (10MPhamWMF) 05Open→03Declined Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of...