[05:22:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1085%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:27:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1085%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:36:35] !log clean up my old home dir on matomo1002, ran `apt-get clean` + some other clean up steps on matomo1002 to free space on the root partition [06:36:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:37:37] folks there is the possibility to trim more apache2 logs etc.. but I'll wait for your feedback before proceeding [06:37:48] now matomo1002 has less than 90% of space used, no hurry [06:38:07] mysql is growing a lot, maybe we could add another vdisk and mount a partition on it? [06:40:59] Hi elukey - thanks for the cleaning this morning <3 [06:41:08] <3 [06:48:22] RECOVERY - Disk space on matomo1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=matomo1002&var-datasource=eqiad+prometheus/ops [08:15:26] Oh, thanks ever so much elukey. I'll probably add the second disk as you suggest. The growth is caused by the new sound logo mini-site, I believe. [08:42:29] I have created this task and will expedite it to add a new virtual disk to matomo: https://phabricator.wikimedia.org/T318515 [09:14:51] joal: I'm looking at working with aqu today to deploy spark3 across production: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500 [09:15:50] Do you have any particular concerns or suggestions? [09:21:10] Could the 2GB of disk consumption on each worker have consequences on HDFS ? Or is it isolated ? [09:25:32] Good check, but it is all isolated and there is enough capacity on the root volume for an extra 2GB per worker. [09:26:03] I ran `sudo cumin --no-progress A:hadoop-worker 'df -h /'` on a cumin host and it shows that each hadoop worker has about 34 GB free on the root volume. [09:26:48] HDFS is composed of 12 dedicated 4 TB hard drives in each worker, so they don't get used up by any normal O/S files. [09:29:22] Should we pause airflow *before* rolling out the new conda-analytics environment to any hosts? Or is it best to wait until the new package is installed to all hadoop workers before doing that, as you have written? [09:33:43] We don't have to pause Airflow yet. It's the first step of the airflow-dags code upgrade. [09:35:17] Hi btullis - Nothing specific about spark3 - I assume we'll manually test after the deploy before going for the airflow change [09:36:17] Normally adding spark3 will not change anything for the exisitng airflow jobs [09:36:22] (03PS3) 10Phuedx: mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) [09:36:28] To be more clear, we have 2 big steps today: 1/ finish the deployment of conda-analytics+Spark3 conf on the whole clutster, 2/ deploy the airflow-dags code to use Spark 3 from deb. [09:36:54] (03CR) 10CI reject: [V: 04-1] mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) (owner: 10Phuedx) [09:37:52] OK, thanks both. I'm happy to proceed to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500 if you are. [09:39:01] OK for me [09:39:04] all good for me [09:39:50] question for you aqu - I assume we have the archive file in place on HDFS? [09:40:50] !log merged the spark3 patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500 [09:40:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:41:15] I just checked, we have it :) [09:41:20] !log rebooted matomo1002 at the VM level to pick up new disk [09:41:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:41:50] wow spark 3 everywhere? [09:42:23] Yeah! [09:42:32] \o/ [09:43:12] Keeping an eye on https://debmonitor.wikimedia.org/packages/conda-analytics - it's already being reporting on three more an-worker hosts, so that's good. [09:46:21] joal: yes the spark-assembly is already on hdfs. [10:01:36] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:32] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:47] btullis: from my machine I can't run an apt update on http://mirrors.wikimedia.org/debian I get a GPG error "InRelease: At least one invalid signature was encountered." (I am building locally on my machine with docker the airflow deb) [10:06:36] aqu: o/ https://wikitech.wikimedia.org/wiki/APT_repository#External_Access has some info (related to old Debian versions but you should be able to fix the config easily) [10:07:04] (see also the Security section below) [10:08:12] Thx [10:08:17] elukey: You're much faster than I am :-) [10:09:18] btullis: hahahahah I had it bookmarked :D [10:20:22] 10Analytics-Radar, 10Dumps-Generation: Sample HTML Dumps - Request for feedback - https://phabricator.wikimedia.org/T257480 (10Aklapper) a:05RBrounley_WMF→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on... [10:21:00] aqu: The conda-analytics package is now installed on 105 hosts. Are you intending to merge and deploy the airflow-dags change yourself? [10:21:47] btullis: let's first do some manual testing :) [10:21:50] btullis: Yes, I will do it. But first some more tests :) Thx! [10:22:34] Oh yes, sure. I was only trying to understand the level to which you required my help. Is there anything I can do to help right now? [10:23:15] btullis: I don't think so, we're gonna test, and then for airflow we can do without you [10:27:48] 10Data-Engineering, 10Event-Platform Value Stream, 10Product-Analytics: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Aklapper) a:05Ottomata→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See... [10:33:43] 10Quarry, 10Patch-Needs-Improvement: Add rate limiting on queries execution - https://phabricator.wikimedia.org/T225869 (10Aklapper) a:05Framawiki→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 2... [10:34:51] 10Analytics-Radar, 10Privacy Engineering, 10SRE, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Aklapper) a:05ssingh→03None Removing task assignee due to inactivity as this open task has been assigned for mo... [10:36:39] 10Analytics-Wikistats, 10Data-Engineering: Implement inequality metrics for WikiStats - https://phabricator.wikimedia.org/T248964 (10Aklapper) a:05Quasipodo→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee o... [10:38:16] (03PS4) 10Phuedx: mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) [10:43:23] OK, the matomo issue is now resolved. I've added 80 GB of virtual disk for use with `/var/lib/mysql`and it's now in use. [11:03:18] !log failing back hive to an-coord1001 using DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/832294 [11:03:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:48:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:55:46] hi dcausse - Quick note: we have 2 mjolnir jobs running currentl on the cluster - one for 20220908 and the other for 20220915 - The take almost no resource so no issue, but I thought I'd rather tell [11:56:55] aqu: let me know if you wish help when updating airflow - I have quickl tested a spark-shell and it worked :) [11:59:16] 10Analytics, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10MoritzMuehlenhoff) Not sure what best to do here since we have no real insight why Gmail flagged it as such. We could maybe send these with a dedicated @wikimedia... [12:05:59] joal: hi! thanks for the heads up, taking a look [12:11:35] 10Analytics, 10Infrastructure-Foundations, 10Mail: kerberos manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) I have seen other reports of this from end-users. e.g. T317545#8240269 so I think it would be a nice one to address. I don't see any refererence to DKIM... [12:57:02] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) [12:57:04] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [13:18:53] joal: I've check again my MR https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/141 [13:18:53] And you may be interested in this commit https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/141/diffs?commit_id=809d2d1376e372c2824168bdedcb5d9e0e4452dd [13:21:10] Indeed aqu :) Thanks for this [13:22:18] aqu: There will probably be a need for some adjustment in skein logs I assume [13:23:26] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:24:10] aqu: currently reviewing the whole patch [13:25:35] Thx [13:28:31] aqu: 1 change demand (actually not related to spark3 - For the rest all good) [13:46:16] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:47:51] ! Deploying airflow-dags on analytics & analytics_test [13:48:07] !log Deploying airflow-dags on analytics & analytics_test [13:48:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:59:33] aqu - I'd have liked to pause the dags before deploying [13:59:45] aqu: just to be on he safer side [13:59:54] Done now, so let's monitor :) [14:04:39] I did pause 26/26 of them. [14:04:55] Ah! And you restarted all of them already! [14:05:38] I'd have restarted them by waves, first a single hourly job, wait for success, then all hourly, then all others [14:05:45] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10Antoine_Quhen) [14:05:54] No worries though aqu, let's monitor :) [14:22:53] mforns: I finally reviewed your patch! sorry for the delay :S [14:23:13] no problemo joal! will look :] [14:28:19] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10FGoodwin) [14:34:45] aqu: mediarequest hourly ran successfully \o/ [14:36:53] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) [14:37:04] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) 05Open→03Resolved [14:45:25] aqu: generate_and_send_apis_metrics_to_graphite failed with an interesting error [14:45:57] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:49:48] I think I understand the problem aqu [14:49:53] hm [14:53:54] ? [14:54:16] parameter passing in skein doesn't work exactly as in airflow [14:55:45] And, as one parameter is empty it isn't passed as expected in skein [14:59:00] ohhhh [15:06:54] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) 05Open→03Resolved [15:08:45] (03CR) 10Xcollazo: "Still don't have merge/verify privileges 😞. Could one of you folks merge for me please?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/834535 (owner: 10Xcollazo) [15:12:02] btullis: I forgot to tell you - we have stopped loading "the old" cassandra cluster - it's read for you to deprecate :) [15:12:15] s/read/ready [15:19:19] 10Data-Engineering-Kanban, 10Data Pipelines (Sprint 02): Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10EChetty) [15:43:07] (03CR) 10MNeisler: [C: 03+2] New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [15:43:47] (03Merged) 10jenkins-bot: New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch) [16:10:18] (03CR) 10Joal: [V: 03+2 C: 03+2] Reinstate changes from 7c5ffce unique_devices CREATE statements [analytics/refinery] - 10https://gerrit.wikimedia.org/r/834535 (owner: 10Xcollazo) [16:10:30] xcollazo: merged ! --^ [16:10:57] joal: ty! [16:11:35] joal: also, happy to meet whenever to discuss iceberg. [16:11:46] xcollazo: now? [16:11:48] batcave? [16:12:04] yes, but what are the batcave coordinates? [16:12:42] xcollazo: https://meet.google.com/rxb-bjxn-nip [16:13:56] !log rerunning failed webrequest-text-2022-09-26-15 [16:13:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:14:30] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:37:20] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:54:29] joal: I pushed your requested changes to the unique_devices DAGs. There's one pending comment still, please see my response in the MR https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/140 [17:39:01] joal: Seems like you can use INSERT OVERWRITE w/o specifying the partitions and Iceberg will figure it out. They call it 'dynamic'. Details at: https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite [17:39:36] And you could do it with SQL or with Spark syntax. [17:52:35] mforns: sorry I've taken a minute before reviewing - all good for me now [17:52:51] mforns: Shall I merge? [17:55:16] xcollazo: When reading "When Spark’s overwrite mode is dynamic, partitions that have rows produced by the SELECT query will be replaced" it makes me think the whole partition will be replaced - which is not what we wish! [17:58:44] joal: thank you! I will wait to merge until I pair with Sandra, so that I'm sure that the HDFSArchiveOperator is used properly! [18:13:50] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:30:34] By the way mforns, Sandra mentionned in meeting that ou gus had find the bug about template-expansion - I'm eager to know more if you may [18:44:57] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:45:29] joal: > it makes me think the whole partition will be replaced [18:45:29] Oh, right, because the new partitioning schema in Iceberg is month(ts), correct? [18:46:22] that's right xcollazo - the point of us moving to iceberg is to be able to lower partition time granularity [18:48:10] joal: wanna meet about the template rendering thing? [18:48:18] sure mforns - batcave? [18:48:23] yep! [19:26:42] joal: we can try to change the MERGE strategy. By default, Iceberg uses 'copy-on-write' (i.e pay all penalties on write), but recent work allows MERGE to be done via 'merge-on-read' (i.e. create delta files, to be compacted later). [19:27:16] interesting xcollazo! [19:27:47] https://iceberg.apache.org/docs/latest/configuration/ Relevant TBLPROPERTIES are 'write.merge.mode', 'write.update.mode', and 'write.delete.mode'. They all default to copy-on-write. [19:28:39] Definitely something to test [20:07:40] !log Kill oozie geoeditors jobs for load, public monthly, and yearly after Airflow migration. [20:07:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:40:18] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:14] (03PS6) 10Neil P. Quinn-WMF: Begin sanitizing Wikistories streams [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262) [21:58:55] (03CR) 10Neil P. Quinn-WMF: "Looking forward to your +2, Marcel 😊" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262) (owner: 10Neil P. Quinn-WMF) [22:57:12] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:13] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook