[00:52:55] (03CR) 10MNeisler: talk_page_event schema (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [01:59:05] 10Analytics: Kerberos identity for cicalese - https://phabricator.wikimedia.org/T293850 (10CCicalese_WMF) [04:24:27] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:08:25] PROBLEM - Hadoop DataNode on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:14:11] very weird, on analytics1066 there are a ton of processes, all for https://yarn.wikimedia.org/cluster/app/application_1633985963344_37203 [06:14:23] I killed the main java apps [06:14:34] and I see Kernel soft lockups messages [06:14:42] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=analytics1066&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics [06:25:13] RECOVERY - Hadoop DataNode on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [06:27:39] !log reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage [06:27:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:28:23] so I got the CPU down by killing processes, but for some reason the yarn node manager (shown as ) was at 100% usage in top and afaics accepting jobs [06:28:56] I've seen the soft lockup issue sporadically for hdfs datanodes in the past, but not this mess [06:38:37] (of course a simple `shutdown -r` didn't work, had to powercycle) [07:15:43] !log rerun webrequest-load-wf-upload-2021-10-20-1 after node issue [07:15:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:18:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10elukey) I got it working, it is a simple move in the configuration panel for the Presto Database. I moved the "Other" add... [07:22:51] 10Analytics, 10SRE, 10SRE Observability (FY2021/2022-Q2): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) Thank you for the quick followup everyone! Please note that this work isn't super urgent on our (o11y) end, although graphite/statsd are in "support mo... [09:09:05] elukey: Thanks for dealing with this worker node. I'll see if I can see any pattern to the CPU soft lockups, or any other reasons why the yarn node manager might have got into this state. [09:10:24] btullis: np! I saved the output of ps auxff into my home dir if you want to check it [09:10:44] never seen a mess like that [09:11:06] hopefully this will be a one-time thing, if it rehappens some more investigation will be probably needed [10:31:26] (03CR) 10Michael Große: [C: 03+1] "Looks like it would do the trick. Though I'm not sure if we still need this script in the first place? Even if we add some statistics back" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [10:34:14] (03CR) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [10:37:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Don't crash if wb_changes is empty [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732005 (owner: 10Michael Große) [10:38:17] (03Merged) 10jenkins-bot: Don't crash if wb_changes is empty [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732005 (owner: 10Michael Große) [10:49:16] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) [10:49:36] (03CR) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [11:02:43] (03CR) 10Michael Große: [C: 03+2] Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [11:03:41] (03Merged) 10jenkins-bot: Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [11:04:34] (03PS2) 10Lucas Werkmeister (WMDE): Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) [11:04:42] (03CR) 10jerkins-bot: [V: 04-1] Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [11:04:51] (03CR) 10Lucas Werkmeister (WMDE): Check that change dispatch statistics are present (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/731351 (https://phabricator.wikimedia.org/T293329) (owner: 10Lucas Werkmeister (WMDE)) [11:05:34] (03PS3) 10Lucas Werkmeister (WMDE): Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) [11:08:16] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) The sixth and final snapshot that we still need to load has now been transferred to an-presto1001. ` btullis@cumin1... [11:34:41] (03PS1) 10Awight: Don't crash if wb_changes is empty [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732078 [11:35:20] (03PS1) 10Awight: Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732079 (https://phabricator.wikimedia.org/T293329) [11:39:50] (03CR) 10Awight: [C: 03+2] Don't crash if wb_changes is empty [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732078 (owner: 10Awight) [11:40:14] (03CR) 10Awight: [C: 03+2] Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732079 (https://phabricator.wikimedia.org/T293329) (owner: 10Awight) [11:40:26] (03Merged) 10jenkins-bot: Don't crash if wb_changes is empty [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732078 (owner: 10Awight) [11:40:54] (03Merged) 10jenkins-bot: Check that change dispatch statistics are present [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732079 (https://phabricator.wikimedia.org/T293329) (owner: 10Awight) [11:56:42] 10Analytics, 10Analytics-Kanban: Update mediawiki-history jobs spark settings - https://phabricator.wikimedia.org/T290469 (10JAllemandou) 05Open→03Resolved [11:56:55] 10Analytics, 10Analytics-Kanban: Create monthly job for canonical pageviews - https://phabricator.wikimedia.org/T265732 (10JAllemandou) 05Open→03Resolved [11:56:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10JAllemandou) [11:58:35] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) a:05JAllemandou→03None Removing myself as assignee as the task has no traction currently. Will revisit as needed. [11:59:24] btullis: Heya - cassandra compactions are moving slowly, but they are moving :) [11:59:34] btullis: I think we'll be able to restart loading tomorrow [12:12:31] joal: Yes I agree. [12:16:12] I've just finished removing all of the remaining snapshots from aqs101[0-5] and will be ready to restart the loading from an-presto1001 tomorrow. [12:22:03] Awesome :) [12:54:44] PROBLEM - Webrequests Varnishkafka log producer on cp3062 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:55:56] aouch --^ :S [12:57:07] Looking --^ I've not yet had anything to do with varnishkafka, but I'll check now. [12:57:10] PROBLEM - statsv Varnishkafka log producer on cp3062 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:57:44] Ah, ema is on it. Mentioned in #wikimedia-operations. [12:57:52] It is depooled. [12:58:04] RECOVERY - Webrequests Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:58:20] RECOVERY - statsv Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:30:42] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) [13:59:54] (03CR) 10Michael Große: [C: 03+2] Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [14:01:07] (03Merged) 10jenkins-bot: Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732277 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [14:54:06] 10Analytics: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) Had a [[ https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-data-persistence/20211020.txt | chat ]] with @Marostegui and @Kormat in #wikimedia-data-persistence today, in w... [14:57:00] (03PS1) 10Lucas Werkmeister (WMDE): Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732329 (https://phabricator.wikimedia.org/T292604) [15:02:00] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) Yay @elukey!!! [15:02:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732329 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:03:38] (03Merged) 10jenkins-bot: Remove dispatch.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/732329 (https://phabricator.wikimedia.org/T292604) (owner: 10Lucas Werkmeister (WMDE)) [15:04:47] joal: I'm ready to deploy https://gerrit.wikimedia.org/r/c/analytics/refinery/+/724412, what do you think? [15:31:30] 10Analytics-Radar, 10MobileFrontend, 10XAnalytics: MobileFrontend should use XAnalytics extension - https://phabricator.wikimedia.org/T217859 (10Jdlrobson) [15:36:26] "This is the vote for release 3.0.0 of Apache Bigtop." [15:36:28] wowwww [15:39:54] it would be super nice to import the packages and test a quick upgrade in hadoop test, and see if we can report back any issue (the previous cookbook to upgrade hadoop test should work in theory) [15:40:09] but it may disrupt hadoop test for a bit [15:40:24] and also there are the hadoop 2 -> 3 config changes [15:41:16] weooowww [15:45:29] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10elukey) As follow up I'd check random dashboards and try to play with the sql lab before deciding to upgrade, there may b... [15:58:36] \o/! Hadoop 3 ! [15:59:23] Hi razzi - you can prceed with the refinery deploy - sorry for being late in my answer [16:00:19] ok cool [16:06:12] 10Analytics, 10Analytics-Kanban: Kerberos identity for cicalese - https://phabricator.wikimedia.org/T293850 (10odimitrijevic) p:05Triage→03High a:03razzi [16:06:47] 10Analytics, 10Analytics-Kanban: Kerberos request ticket for Naray-ctr - https://phabricator.wikimedia.org/T293814 (10odimitrijevic) p:05Triage→03High a:03razzi [16:07:09] 10Analytics, 10Data-Services, 10Privacy Engineering, 10cloud-services-team (Kanban): Raw IPs of logged-out users disclosed in wiki-replicas - https://phabricator.wikimedia.org/T284948 (10nskaggs) [16:08:33] 10Analytics, 10Data-Services: Expose more properties to the user_properties_anon table on Wiki Replicas - https://phabricator.wikimedia.org/T226162 (10nskaggs) a:03odimitrijevic [16:26:53] (03CR) 10Razzi: [V: 03+2 C: 03+2] Update hdfs-cleaner jar for disallowlist change [analytics/refinery] - 10https://gerrit.wikimedia.org/r/724412 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [16:35:42] Hi #wikimedia-analytics deploying refinery as per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery [16:36:19] !log deploy refinery change for https://phabricator.wikimedia.org/T287084 [16:36:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:47:23] (03PS3) 10Clare Ming: Add new web A/B test schema to track bucketing of users for a given experiment. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/732089 (https://phabricator.wikimedia.org/T292587) [16:48:26] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, and 2 others: Upgrade Superset to 1.3.1 or higher - https://phabricator.wikimedia.org/T288115 (10razzi) Sounds good @elukey. Us on Data Engineering will test for the next week, then announce a "release candidate" when... [16:49:48] (03CR) 10DLynch: talk_page_event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [16:50:08] (03PS4) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) [17:54:53] Deploying refinery to HDFS [18:10:04] (03CR) 10MNeisler: [C: 03+1] talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [18:11:52] !log Deployed refinery using scap, then deployed onto hdfs [18:11:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:12:00] \o/ thanks razzi :) [18:12:04] joal: the latest is on hdfs [18:12:12] Do you have any specific asks for job restarts etc? [18:12:15] np! [18:12:21] razzi: would you mind merging/deploying the puppet patch that goes with the change? [18:12:27] Oh right :) [18:12:38] razzi: no other thing to do after that [18:14:52] Puppet merged! [18:15:56] awesome - I'm gonna check the next run (if you don't mind to stay nearby for a few minutes, I'd feel more comfortable - new timer first run - maybe something wrong could happen :) [18:16:05] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) @Milimetric any progress to report? 😊 [18:16:13] razzi: --^ [18:16:15] sorry [18:20:11] (03CR) 10Ottomata: talk_page_event schema (036 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [18:21:42] 10Analytics, 10Event-Platform, 10MediaWiki-extensions-CentralAuth, 10Unstewarded-production-error, 10Wikimedia-production-error: Error: Call to undefined method MediaWiki\Extension\EventStreamConfig\StreamConfig::get() - https://phabricator.wikimedia.org/T293919 (10Majavah) [18:26:10] Ah actually razzi the timer runs for the fist time in a few hours [18:26:21] Could we force a manual run of it now? [18:35:33] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:38:46] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:39:07] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) a:03Jclark-ctr [18:39:46] razzi: quick ping - the timer [18:39:49] arf again [18:40:14] 10Analytics, 10Event-Platform, 10MediaWiki-extensions-CentralAuth, 10Unstewarded-production-error, 10Wikimedia-production-error: Error: Call to undefined method MediaWiki\Extension\EventStreamConfig\StreamConfig::get() - https://phabricator.wikimedia.org/T293919 (10Majavah) Not at all sure what's going o... [18:40:16] razzi: quick ping - the new timer created by the patch is not yet visible on an-launcher1002 - I wonder if that's expected :( [18:41:02] 10Analytics, 10Event-Platform, 10MediaWiki-extensions-CentralAuth, 10Unstewarded-production-error, 10Wikimedia-production-error: Error: Call to undefined method MediaWiki\Extension\EventStreamConfig\StreamConfig::get() - https://phabricator.wikimedia.org/T293919 (10Majavah) [18:42:39] 10Analytics, 10DC-Ops, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10RobH) [18:44:33] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) @nshahquinn-wmf I'm sad to report that I haven't even gotten a chance to start working on this yet. All meetings and code j... [18:56:30] Done for today - let's revisit the timer problem tomorrow if it has not run [19:07:23] (03CR) 10DLynch: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [19:09:43] (03CR) 10DLynch: talk_page_event schema (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [19:29:46] 10Analytics, 10Event-Platform, 10MediaWiki-extensions-CentralAuth, 10Unstewarded-production-error, 10Wikimedia-production-error: Error: Call to undefined method MediaWiki\Extension\EventStreamConfig\StreamConfig::get() - https://phabricator.wikimedia.org/T293919 (10Krinkle) 05Open→03Declined Agreed.... [20:12:30] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) Thanks for the update, and no worries! I just wanted to get a sense since I'm planning my own wmfdata work for the quart... [20:22:36] Sorry joal I was on lunch break, should have mentioned [20:23:03] I had signed out of irccloud so I didn't see any of the pings :( [21:14:57] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:15:59] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10RobH) [21:16:06] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:19:02] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) [21:19:23] 10Analytics, 10DC-Ops, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10RobH) @Jclark-ctr This is a spare system we already have in netbox. It just needs to relocate from row D, as its being allocated into service as redundant to a server in D3...