[00:32:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:15:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:17:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:17:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:47:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:27:29] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) Hi @BTullism,  We've talked with the team and decided that we'd postpone wor...
[08:31:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen)
[08:48:51] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) >>! In T266641#9275063, @JAllemandou wrote: > Hi @BTullism,  > We've talked with...
[08:52:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:56:49] <joal>	 btullis: good morning!
[08:56:58] <joal>	 btullis: I have questions regarding datahub if you may
[08:57:32] <btullis>	 Good morning joal. Ask away. Would you like to batcave?
[08:57:45] <joal>	 If you don't mind, it'll be easier :)
[08:57:51] <btullis>	 We have an SRE sync in 3 minutes. Would you like to join that?
[08:58:04] <btullis>	 meet.google.com/ort-mznr-eeu
[08:58:07] <joal>	 btullis: I'll join and particiapte if htere is free time :)
[08:58:10] <joal>	 Thanks :)
[08:59:27] <btullis>	 Reminder for anyone: in about 30 minutes, I will be rebuilding s2-analytics-replica.eqiad.wmnet and it will be down for a while. T343109
[08:59:27] <stashbot>	 T343109: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109
[09:39:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * Configured downtime for the s2 services and systemd icinga check on dbstore1007.  * Created a backup of the permissions: ` root@dbstore1007:~...
[09:46:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * Checked https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance and confirmed that there is no maintenance on s2 (nor s3, s4) current...
[09:52:46] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:02:18] <btullis>	 !log stopping and deleting s2 on dbstore1007.
[10:02:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:04:00] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * disabled puppet on dbstorew1007 * stopped the slave threads, * stopped the service * deleted /srv/sqldata.s2 ` root@dbstore1007:~# systemctl...
[10:06:45] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Running the following as root in a screen session on cumin1001. ` root@cumin1001:~# transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/b...
[10:07:13] <jinxer-wm>	 (DiskSpace) firing: Disk space an-worker1128:9100:/var/lib/hadoop/data/c 5.927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1128 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[10:07:33] <btullis>	 !log transferring snapshot s2.2023-10-23--01-34-18 from dbprov1004 to dbstore1007:/srv/sqldata.s2
[10:07:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:37:13] <jinxer-wm>	 (DiskSpace) firing: (2) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 5.577% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[10:45:13] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) `an-test-client1002` is running on ganeti1010. ` btullis@ganeti1027:~$ sudo gnt-instance list an-test-client1002.eqiad.wmnet...
[10:47:46] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:46] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:03] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) > What's it with the deprecation of a central statsd server. I think we need statsd for Prometheus logging, am I right? You're right in a w...
[11:24:15] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) The transfer compled successfully. ` 2023-10-24 10:06:09  INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s2.2023-10-23--01-34-1...
[11:27:13] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson)
[11:39:15] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Executed: ` chown mysql.mysql /srv/sqldata.s2 systemctl start mariadb:s2 ` Obtained the following GTID position. ` root@dbstore1007:/srv/sqldat...
[11:46:19] <wikibugs>	 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook)
[11:48:48] <wikibugs>	 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/30
[11:49:37] <wikibugs>	 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605 (10rook)
[11:50:06] <wikibugs>	 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook)
[11:50:09] <wikibugs>	 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605 (10rook)
[11:50:36] <wikibugs>	 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook) T349605 created and linked to T349029
[11:52:56] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Restoring the user backup didn't go as well as I had hoped. ` root@dbstore1007:/srv/sqldata.s2# pt-show-grants -S /run/mysqld/mysqld.s2.sock --...
[11:55:17] <wikibugs>	 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/30
[11:59:54] <wikibugs>	 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook) 05Open→03Resolved
[12:03:58] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10Marostegui) You probably need to create that user, pt-show-grants output don't give you the create user, so you'll need to do that. ` create user `maria...
[12:04:18] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I have manually created the `mariadb.sys@localhost` user with: ` CREATE USER `mariadb.sys`@`localhost` ACCOUNT LOCK PASSWORD EXPIRE; FLUSH PRIV...
[12:09:12] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9275694, @Marostegui wrote: > You probably need to create that user, pt-show-grants output don't give you the create user, so yo...
[12:11:18] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Data Products (Sprint 03)): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF)
[12:17:38] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Great! I can now build a conda environment for testing this. e.g. ` (base) btullis@an-test-client1002:~$ conda create -n spa...
[12:17:46] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF)
[12:18:50] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF)
[12:19:22] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10hnowlan) This change has been deployed, and 404 errors have greatly dropped off. Please update if you see any persi...
[12:28:57] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Install wmfdata-python into my new environment. ` pip install git+https://github.com/wikimedia/wmfdata-python.git@v2.0.1 ` N...
[12:34:20] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) I think it's worth adding a query-logger :) I think it's worth spending the...
[12:41:25] <joal>	 !log Drop wmf.referrer_daily hive table and data
[12:41:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:48:45] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson)
[12:50:57] <wikibugs>	 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/28
[12:51:05] <wikibugs>	 (03Merged) 10jenkins-bot: cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson)
[12:51:41] <wikibugs>	 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook)
[12:51:43] <wikibugs>	 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10rook) 05Open→03Resolved
[13:48:26] <milimetric>	 hi kevinbazira, quick note about https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api), we're currently trying to move requests from restbase to rest-gateway
[13:49:00] <milimetric>	 sorry for the change, I'm still getting up to speed with the details, but just fyi, this is going to change slightly in the next few days
[13:57:16] <kevinbazira>	 hi milimetric o/ 
[13:57:18] <kevinbazira>	 no problem. please let us know when the change has been made so that we can update the settings on our end. thank you.
[14:02:27] <hnowlan>	 kevinbazira: I've updated the docs to reflect the newer rest-gateway way - just a different listener/port, URLs stay the same https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api) 
[14:03:20] <kevinbazira>	 thanks hnowlan!
[14:03:54] <hnowlan>	 lemme know if you have any issues
[14:04:21] <kevinbazira>	 sure sure
[14:12:26] <elukey>	 milimetric: o/ we hit the AQS API directly (via AQS LVS VIP), bypassing restbase. For internal calls, do you prefer clients to use rest-gateway? If not we are set :)
[14:12:31] <elukey>	 cc: hnowlan: --^
[14:16:33] <milimetric>	 as I understand it, we're trying to get any internal stuff off of restbase and onto rest-gateway, as services are available via that new route.  What's the AQS LVS VIP?  Like the public endpoint?  That's fine, it'll be routed properly to rest-gateway behind the scenes.  The idea is that we're looking to sunset AQS 1.0 as soon as possible
[14:16:35] <hnowlan>	 elukey: which endpoints do you use it for? everything but edits, bytes_difference and edited_pages are migrated to the rest-gateway 
[14:17:04] <milimetric>	 (and those will be migrated imminently)
[14:17:06] <hnowlan>	 it'll respond internally for all requests though 
[14:17:07] <elukey>	 hnowlan: aqs.discovery.wnnet basically
[14:17:18] <elukey>	 via the aqs mesh endpoint (k8s)
[14:17:45] <hnowlan>	 if you add the listener for rest-gateway and change the mesh port to rest-gateway you should get identical behaviours 
[14:17:47] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:17:51] <hnowlan>	 (although it's backed by different services) 
[14:17:54] <hnowlan>	 no rush right now though 
[14:18:32] <elukey>	 hnowlan: the apis are different, but we can surely adjust. I thought that AQS discovery was preferred vs using rest-gateway (more for external users)
[14:19:51] <hnowlan>	 elukey: which apis are different? rest-gateway *should* handle identical paths 
[14:19:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:59] <hnowlan>	 elukey: aiui ultimately all requests will go to rest-gateway and we'll deprecate aqs1 entirely - we don't have a timeline for that though 
[14:20:14] <btullis>	 The `aqs` envoy listener sends it via the LVS direct to the nodejs services that is co-located with the cassandra cluster in eqiad. This nodejs service is what is being deprecated.
[14:20:44] <elukey>	 hnowlan: do you mean same as restbase? Because the AQS nodejs has a different API, this is what I meant.. but if it will be deprecated makes sense
[14:20:45] <btullis>	 That's why the advice I have you last week (to use the `aqs` listener) was outdated.
[14:21:00] <hnowlan>	 elukey: ahhh
[14:21:11] <hnowlan>	 elukey: yeah I meant restbase 
[14:21:48] <elukey>	 we can definitely move to rest-gateway, are there any examples about how to query the pageview API?
[14:21:54] <hnowlan>	 to be clear - A proper plan and timeline around deprecation etc is yet to be established, it'd just be good to avoid *new* stuff being built
[14:21:56] <elukey>	 ah sorry same as restbase, my bad
[14:21:57] <elukey>	 okok
[14:22:13] <jinxer-wm>	 (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 2.866% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[14:22:16] <elukey>	 we can move to rest-gateway, will work with kevinbazira on this
[14:23:01] <hnowlan>	 lemme know if you need any help or we can make any improvements! 
[14:23:51] <elukey>	 sure!
[14:26:17] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I created the role manually: ` CREATE ROLE 'research_role'; ` The the other two commands worked. ` GRANT `research_role` TO `research`@`10.%`;...
[14:28:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) 05Open→03Resolved
[14:31:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:47] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:11] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10jcrespo) As a followup- consider in the future documenting the special grants on puppet. We don't have a good solution to monitor and assign them, but a...
[14:54:53] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9276578, @jcrespo wrote: > As a followup- consider in the future documenting the special grants on puppet. We don't have a good...
[15:00:55] <wikibugs>	 (03CR) 10Sbisson: [C: 03+1] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1)
[15:00:59] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I have now added those missing role grants: ` GRANT USAGE ON *.* TO research_role; GRANT SELECT ON `wikishared`.* TO research_role; GRANT SELEC...
[15:02:47] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:02:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:51] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:47] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:16] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) I tried to profile nodejs' code via `-prof` and `--prof-process` to have a better view of the CPU usage. I tried first with `perf` but I didn't obtain useful info...
[15:29:49] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] eventutilities-python should convert pyflink Instants to python DateTimes - https://phabricator.wikimedia.org/T349640 (10Ottomata)
[15:52:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen)
[15:54:26] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen)
[16:46:11] <xcollazo>	 !log Deploying latest DAGs to analytics Airflow instance
[16:46:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:16:01] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) a:03rook
[17:18:18] <wikibugs>	 (03CR) 10Nettrom: [C: 04-1] "Things largely look good, and there are a couple of outstanding issues I'd like to see resolved before we deploy this." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime)
[17:18:36] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/...
[17:19:30] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto opened https://gitlab.wikimedia....
[17:31:52] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata)
[17:32:47] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:32:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:43] <wikibugs>	 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) I've upgraded eventgate instances in deployment-prep / beta.  Things look...
[17:47:47] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:51:57] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:08:13] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1146 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:22:13] <jinxer-wm>	 (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.1228% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[18:28:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1146 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:32:13] <jinxer-wm>	 (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.01951% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[18:35:11] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:37:13] <jinxer-wm>	 (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.0002979% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space  - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[18:42:55] <joal>	 xcollazo: Heya - I think it would have been a good idea not to let airflow run the missed metadata_ingest_daily
[18:42:58] <joal>	 Schedule: @daily info Next Run: 2023-10-23, 00:00:00
[18:42:59] <joal>	 runs
[18:43:37] <joal>	 it ran consecutively a few instance of each task, while it could have been only one :)
[18:43:50] <joal>	 no big deal though, just a thought
[18:44:20] <xcollazo>	 got it, makes sense. next time!
[18:44:27] <joal>	 xcollazo: cheers :)
[18:44:40] <joal>	 xcollazo: another question - are you rebuilding a new dumps table?
[18:45:04] <joal>	 I assume som, based on the latest PRs I have seen - probably with month-partitioning (yeah!)
[18:45:40] <joal>	 If so, I wonder if the issues we're seeing on disk-space usage could be realted to a backfilling job
[18:45:40] <xcollazo>	 that `run_hive_event_sanitized_ingestion` task from metadata_ingest_daily runs for almost two hours... 
[18:46:08] <joal>	 xcollazo: millions of partitions :)
[18:46:13] <xcollazo>	 > are you rebuilding a new dumps table?
[18:46:14] <xcollazo>	 yes
[18:47:24] <xcollazo>	 >If so, I wonder if the issues we're seeing on disk-space usage could be realted to a backfilling job
[18:47:24] <xcollazo>	 likely, but this one will be months partitioned as you mention, so as soon as it clears, I will nuke the old wikitext_raw_rc1, which will give us back ~7M inodes.
[18:47:40] <joal>	 Yes, I know that :)
[18:48:02] <xcollazo>	 wait...
[18:48:05] <joal>	 I'm trying to build an understanding as to why we get so much data into a so small number of partitions
[18:48:06] <xcollazo>	 You said space
[18:48:11] <joal>	 xcollazo: I think it's temporary data
[18:48:19] <joal>	 xcollazo: I said space :)
[18:48:21] <joal>	 indeed!
[18:48:30] <joal>	 For once, not files-number related :)
[18:49:10] <xcollazo>	 ok yes this tracks
[18:49:17] <xcollazo>	 I am running a giant shuffle right now
[18:49:19] <xcollazo>	 https://yarn.wikimedia.org/proxy/application_1695896957545_164003/SQL/execution/?id=4
[18:50:25] <xcollazo>	 ~36.2 TiB shuffle write
[18:50:30] <xcollazo>	 easy stuff
[18:50:33] <xcollazo>	 :D
[18:52:29] <xcollazo>	 I wonder though, I've run this pipeline before a couple times, don't recall space issues.
[18:52:54] <joal>	 xcollazo: I think the monthly partitionning could change the game
[18:53:24] <joal>	 by condensing some skewed data into some workers
[18:54:44] <joal>	 example: For the stage having 6 tasks failed, they all were onto an-worker1146 - which has been showing space issues
[18:55:24] <joal>	 3 of the failed tasks are due to java.io.IOException: No space left on device
[18:55:36] <joal>	 Nah, actually all of them are
[18:59:27] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata)
[19:01:06] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/7  Update python to 3.10
[19:05:48] <xcollazo>	 joal: Ok things look better now. Data sort phase is now ongoing, and I expect that to take ~15 hours. Will monitor it.
[19:06:36] <joal>	 The graphs are fun xcollazo - they show shuffled data being handled as temporary files through capacity remaining: no change in hdfs-used-space, but capacity drop, and now we'll see HDFS usage bump as well as capacity drop, and when job finishes, the capacity will get back
[19:07:01] <joal>	 xcollazo: You remember: same hardware storage space for temporary data and HDFS data!
[19:11:13] <xcollazo>	 joal: right right. Maybe if I keep breaking the cluster I'll eventually get discrete hard drive pools? :D
[19:11:27] <joal>	 xcollazo: Thanks for keeping an eye on it - I'll check tomorrow my morning as well
[19:11:40] <joal>	 xcollazo: ;D
[19:11:51] <joal>	 Gone for tonight folks!
[19:14:14] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata)
[19:14:36] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) a:03Ottomata
[19:38:25] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata)
[19:39:44] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata)
[19:47:57] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata)
[19:48:55] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) We'd like to do this soon.  I'm aiming for November 6th....
[19:50:44] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/527  search: Update mj...
[19:55:46] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/527  search: Update mj...
[19:57:37] <wmf-insecte>	 Starting build #28 for job wikimedia-event-utilities-maven-release-docker
[20:00:46] <wmf-insecte>	 Project wikimedia-event-utilities-maven-release-docker build #28: 09SUCCESS in 3 min 9 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/28/
[20:11:25] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: I was able to verify that Spark 3.1.2 and Spark 3.3.2 work as expected on the test cluster 🎉 . I ran out of time, bu...
[20:14:16] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10xcollazo) > cc @xcollazo for Data Products Sounds good to me.
[20:19:29] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Deployed the above chan...
[20:21:53] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) BTW, just fixed permissions on my Spark 3.3.2 assembly file to readable by all for if other folks want to repro the above n...
[20:23:44] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) Draft announcement: https://docs.google.com/document/d/1Lw...
[20:31:28] <wikibugs>	 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking)
[20:32:05] <wikibugs>	 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking)
[20:32:57] <wikibugs>	 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking)
[20:43:55] <wikibugs>	 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking)
[20:47:12] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking)
[22:15:41] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) 05Open→03Resolved p:05Triage→03Low >>! In T347605#9249978, @dr0ptp4kt wrote: > @bking just wanted to express my gratitude for the...
[22:37:28] <jinxer-wm>	 (DiskSpace) firing: Disk space analytics1075:9100:/var/lib/hadoop/data/l 0.6665% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[22:54:27] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1075 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[23:40:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[23:42:13] <jinxer-wm>	 (DiskSpace) resolved: Disk space analytics1075:9100:/var/lib/hadoop/data/l 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace