[05:22:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1085%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[05:27:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp1085 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1085%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[06:36:35] <elukey>	 !log clean up my old home dir on matomo1002, ran `apt-get clean` + some other clean up steps on matomo1002 to free space on the root partition
[06:36:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:37:37] <elukey>	 folks there is the possibility to trim more apache2 logs etc.. but I'll wait for your feedback before proceeding
[06:37:48] <elukey>	 now matomo1002 has less than 90% of space used, no hurry
[06:38:07] <elukey>	 mysql is growing a lot, maybe we could add another vdisk and mount a partition on it?
[06:40:59] <joal>	 Hi elukey - thanks for the cleaning this morning <3
[06:41:08] <elukey>	 <3
[06:48:22] <icinga-wm>	 RECOVERY - Disk space on matomo1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=matomo1002&var-datasource=eqiad+prometheus/ops
[08:15:26] <btullis>	 Oh, thanks ever so much elukey. I'll probably add the second disk as you suggest. The growth is caused by the new sound logo mini-site, I believe.
[08:42:29] <btullis>	 I have created this task and will expedite it to add a new virtual disk to matomo: https://phabricator.wikimedia.org/T318515
[09:14:51] <btullis>	 joal: I'm looking at working with aqu today to deploy spark3 across production: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500
[09:15:50] <btullis>	 Do you have any particular concerns or suggestions?
[09:21:10] <aqu>	 Could the 2GB of disk consumption on each worker have consequences on HDFS ? Or is it isolated ?
[09:25:32] <btullis>	 Good check, but it is all isolated and there is enough capacity on the root volume for an extra 2GB per worker.  
[09:26:03] <btullis>	 I ran `sudo cumin --no-progress A:hadoop-worker 'df -h /'` on a cumin host and it shows that each hadoop worker has about 34 GB free on the root volume.
[09:26:48] <btullis>	 HDFS is composed of 12 dedicated 4 TB hard drives in each worker, so they don't get used up by any normal O/S files.
[09:29:22] <btullis>	 Should we pause airflow *before* rolling out the new conda-analytics environment to any hosts? Or is it best to wait until the new package is installed to all hadoop workers before doing that, as you have written?
[09:33:43] <aqu>	 We don't have to pause Airflow yet. It's the first step of the airflow-dags code upgrade.
[09:35:17] <joal>	 Hi btullis - Nothing specific about spark3 - I assume we'll manually test after the deploy before going for the airflow change
[09:36:17] <joal>	 Normally adding spark3 will not change anything for the exisitng airflow jobs
[09:36:22] <wikibugs>	 (03PS3) 10Phuedx: mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689)
[09:36:28] <aqu>	 To be more clear, we have 2 big steps today: 1/ finish the deployment of conda-analytics+Spark3 conf on the whole clutster, 2/ deploy the airflow-dags code to use Spark 3 from deb.
[09:36:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689) (owner: 10Phuedx)
[09:37:52] <btullis>	 OK, thanks both. I'm happy to proceed to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500 if you are.
[09:39:01] <aqu>	 OK for me
[09:39:04] <joal>	 all good for me
[09:39:50] <joal>	 question for you aqu - I assume we have the archive file in place on HDFS?
[09:40:50] <btullis>	 !log merged the spark3 patch  https://gerrit.wikimedia.org/r/c/operations/puppet/+/834500
[09:40:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:41:15] <joal>	 I just checked, we have it :)
[09:41:20] <btullis>	 !log rebooted matomo1002 at the VM level to pick up new disk
[09:41:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:41:50] <elukey>	 wow spark 3 everywhere?
[09:42:23] <joal>	 Yeah!
[09:42:32] <elukey>	 \o/
[09:43:12] <btullis>	 Keeping an eye on https://debmonitor.wikimedia.org/packages/conda-analytics - it's already being reporting on three more an-worker hosts, so that's good.
[09:46:21] <aqu>	 joal: yes the spark-assembly is already on hdfs.
[10:01:36] <icinga-wm>	 PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:32] <icinga-wm>	 RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:04:47] <aqu>	 btullis: from my machine I can't run an apt update on http://mirrors.wikimedia.org/debian I get a GPG error "InRelease: At least one invalid signature was encountered." (I am building locally on my machine with docker the airflow deb)
[10:06:36] <elukey>	 aqu: o/ https://wikitech.wikimedia.org/wiki/APT_repository#External_Access has some info (related to old Debian versions but you should be able to fix the config easily)
[10:07:04] <elukey>	 (see also the Security section below)
[10:08:12] <aqu>	 Thx
[10:08:17] <btullis>	 elukey: You're much faster than I am :-) 
[10:09:18] <elukey>	 btullis: hahahahah I had it bookmarked :D
[10:20:22] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation: Sample HTML Dumps - Request for feedback - https://phabricator.wikimedia.org/T257480 (10Aklapper) a:05RBrounley_WMF→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on...
[10:21:00] <btullis>	 aqu: The conda-analytics package is now installed on 105 hosts. Are you intending to merge and deploy the airflow-dags change yourself?
[10:21:47] <joal>	 btullis: let's first do some manual testing :)
[10:21:50] <aqu>	 btullis: Yes, I will do it. But first some more tests :) Thx!
[10:22:34] <btullis>	 Oh yes, sure. I was only trying to understand the level to which you required my help. Is there anything I can do to help right now?
[10:23:15] <joal>	 btullis: I don't think so, we're gonna test, and then for airflow we can do without you
[10:27:48] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Product-Analytics: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Aklapper) a:05Ottomata→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See...
[10:33:43] <wikibugs>	 10Quarry, 10Patch-Needs-Improvement: Add rate limiting on queries execution - https://phabricator.wikimedia.org/T225869 (10Aklapper) a:05Framawiki→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 2...
[10:34:51] <wikibugs>	 10Analytics-Radar, 10Privacy Engineering, 10SRE, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Aklapper) a:05ssingh→03None Removing task assignee due to inactivity as this open task has been assigned for mo...
[10:36:39] <wikibugs>	 10Analytics-Wikistats, 10Data-Engineering: Implement inequality metrics for WikiStats - https://phabricator.wikimedia.org/T248964 (10Aklapper) a:05Quasipodo→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee o...
[10:38:16] <wikibugs>	 (03PS4) 10Phuedx: mediawiki/client/metrics_event: Add mediawiki.database property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777844 (https://phabricator.wikimedia.org/T304689)
[10:43:23] <btullis>	 OK, the matomo issue is now resolved. I've added 80 GB of virtual disk for use with `/var/lib/mysql`and it's now in use.
[11:03:18] <btullis>	 !log failing back hive to an-coord1001 using DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/832294
[11:03:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:48:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[11:55:46] <joal>	 hi dcausse - Quick note: we have 2 mjolnir jobs running currentl on the cluster - one for 20220908 and the other for 20220915 - The take almost no resource so no issue, but I thought I'd rather tell
[11:56:55] <joal>	 aqu: let me know if you wish help when updating airflow - I have quickl tested a spark-shell and it worked :)
[11:59:16] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10Mail: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10MoritzMuehlenhoff) Not sure what best to do here since we have no real insight why Gmail flagged it as such. We could maybe send these with a dedicated @wikimedia...
[12:05:59] <dcausse>	 joal: hi! thanks for the heads up, taking a look
[12:11:35] <wikibugs>	 10Analytics, 10Infrastructure-Foundations, 10Mail: kerberos  manage_principals.py emails go to spam - https://phabricator.wikimedia.org/T318155 (10BTullis) I have seen other reports of this from end-users. e.g. T317545#8240269 so I think it would be a nice one to address.  I don't see any refererence to DKIM...
[12:57:02] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty)
[12:57:04] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty)
[13:18:53] <aqu>	 joal: I've check again my MR https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/141
[13:18:53] <aqu>	 And you may be interested in this commit https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/141/diffs?commit_id=809d2d1376e372c2824168bdedcb5d9e0e4452dd
[13:21:10] <joal>	 Indeed aqu :) Thanks for this
[13:22:18] <joal>	 aqu: There will probably be a need for some adjustment in skein logs I assume
[13:23:26] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:24:10] <joal>	 aqu: currently reviewing the whole patch
[13:25:35] <aqu>	 Thx
[13:28:31] <joal>	 aqu: 1 change demand (actually not related to spark3 - For the rest all good)
[13:46:16] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:47:51] <aqu>	 ! Deploying airflow-dags on analytics & analytics_test
[13:48:07] <aqu>	 !log Deploying airflow-dags on analytics & analytics_test
[13:48:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:59:33] <joal>	 aqu - I'd have liked to pause the dags before deploying
[13:59:45] <joal>	 aqu: just to be on he safer side
[13:59:54] <joal>	 Done now, so let's monitor :)
[14:04:39] <aqu>	 I did pause 26/26 of them. 
[14:04:55] <joal>	 Ah! And you restarted all of them already!
[14:05:38] <joal>	 I'd have restarted them by waves, first a single hourly job, wait for success, then all hourly, then all others
[14:05:45] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 01): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10Antoine_Quhen)
[14:05:54] <joal>	 No worries though aqu, let's monitor :)
[14:22:53] <joal>	 mforns: I finally reviewed your patch! sorry for the delay :S
[14:23:13] <mforns>	 no problemo joal! will look :]
[14:28:19] <wikibugs>	 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Editors service - https://phabricator.wikimedia.org/T288305 (10FGoodwin)
[14:34:45] <joal>	 aqu: mediarequest hourly ran successfully \o/
[14:36:53] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty)
[14:37:04] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) 05Open→03Resolved
[14:45:25] <joal>	 aqu: generate_and_send_apis_metrics_to_graphite  failed with an interesting error
[14:45:57] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[14:49:48] <joal>	 I think I understand the problem aqu
[14:49:53] <joal>	 hm
[14:53:54] <aqu>	 ?
[14:54:16] <joal>	 parameter passing in skein doesn't work exactly as in airflow
[14:55:45] <joal>	 And, as one parameter is empty it isn't passed as expected in skein
[14:59:00] <aqu>	 ohhhh
[15:06:54] <wikibugs>	 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 02), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) 05Open→03Resolved
[15:08:45] <wikibugs>	 (03CR) 10Xcollazo: "Still don't have merge/verify privileges 😞. Could one of you folks merge for me please?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/834535 (owner: 10Xcollazo)
[15:12:02] <joal>	 btullis: I forgot to tell you - we have stopped loading "the old" cassandra cluster - it's read for you to deprecate :)
[15:12:15] <joal>	 s/read/ready
[15:19:19] <wikibugs>	 10Data-Engineering-Kanban, 10Data Pipelines (Sprint 02): Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10EChetty)
[15:43:07] <wikibugs>	 (03CR) 10MNeisler: [C: 03+2] New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch)
[15:43:47] <wikibugs>	 (03Merged) 10jenkins-bot: New schema: editattemptsblocked [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820908 (https://phabricator.wikimedia.org/T310390) (owner: 10DLynch)
[16:10:18] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] Reinstate changes from 7c5ffce unique_devices CREATE statements [analytics/refinery] - 10https://gerrit.wikimedia.org/r/834535 (owner: 10Xcollazo)
[16:10:30] <joal>	 xcollazo: merged ! --^
[16:10:57] <xcollazo>	 joal: ty!
[16:11:35] <xcollazo>	 joal: also, happy to meet whenever to discuss iceberg.
[16:11:46] <joal>	 xcollazo: now?
[16:11:48] <joal>	 batcave?
[16:12:04] <xcollazo>	 yes, but what are the batcave coordinates?
[16:12:42] <joal>	 xcollazo: https://meet.google.com/rxb-bjxn-nip
[16:13:56] <joal>	 !log rerunning failed webrequest-text-2022-09-26-15
[16:13:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:14:30] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:37:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:54:29] <mforns>	 joal: I pushed your requested changes to the unique_devices DAGs. There's one pending comment still, please see my response in the MR https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/140
[17:39:01] <xcollazo>	 joal: Seems like you can use INSERT OVERWRITE w/o specifying the partitions and Iceberg will figure it out. They call it 'dynamic'. Details at: https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite
[17:39:36] <xcollazo>	 And you could do it with SQL or with Spark syntax.
[17:52:35] <joal>	 mforns: sorry I've taken a minute before reviewing - all good for me now
[17:52:51] <joal>	 mforns: Shall I merge?
[17:55:16] <joal>	 xcollazo: When reading "When Spark’s overwrite mode is dynamic, partitions that have rows produced by the SELECT query will be replaced" it makes me think the whole partition will be replaced - which is not what we wish!
[17:58:44] <mforns>	 joal: thank you! I will wait to merge until I pair with Sandra, so that I'm sure that the HDFSArchiveOperator is used properly!
[18:13:50] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:30:34] <joal>	 By the way mforns, Sandra mentionned in meeting that ou gus had find the bug about template-expansion - I'm eager to know more if you may
[18:44:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:45:29] <xcollazo>	 joal: > it makes me think the whole partition will be replaced
[18:45:29] <xcollazo>	 Oh, right, because the new partitioning schema in Iceberg is month(ts), correct?
[18:46:22] <joal>	 that's right xcollazo - the point of us moving to iceberg is to be able to lower partition time granularity
[18:48:10] <mforns>	 joal: wanna meet about the template rendering thing?
[18:48:18] <joal>	 sure mforns - batcave?
[18:48:23] <mforns>	 yep!
[19:26:42] <xcollazo>	 joal: we can try to change the MERGE strategy. By default, Iceberg uses 'copy-on-write' (i.e pay all penalties on write), but recent work allows MERGE to be done via 'merge-on-read' (i.e. create delta files, to be compacted later).
[19:27:16] <joal>	 interesting xcollazo!
[19:27:47] <xcollazo>	 https://iceberg.apache.org/docs/latest/configuration/ Relevant TBLPROPERTIES are 'write.merge.mode', 'write.update.mode', and 'write.delete.mode'. They all default to copy-on-write.
[19:28:39] <joal>	 Definitely something to test
[20:07:40] <xcollazo>	 !log Kill oozie geoeditors jobs for load, public monthly, and yearly after Airflow migration.
[20:07:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:40:18] <icinga-wm>	 PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:56:14] <wikibugs>	 (03PS6) 10Neil P. Quinn-WMF: Begin sanitizing Wikistories streams [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262)
[21:58:55] <wikibugs>	 (03CR) 10Neil P. Quinn-WMF: "Looking forward to your +2, Marcel 😊" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/832383 (https://phabricator.wikimedia.org/T312262) (owner: 10Neil P. Quinn-WMF)
[22:57:12] <icinga-wm>	 PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:58:13] <icinga-wm>	 RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook