[00:32:46] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:46] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:46] (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:47:46] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:29] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) Hi @BTullism, We've talked with the team and decided that we'd postpone wor... [08:31:50] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) [08:48:51] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) >>! In T266641#9275063, @JAllemandou wrote: > Hi @BTullism, > We've talked with... [08:52:46] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:49] btullis: good morning! [08:56:58] btullis: I have questions regarding datahub if you may [08:57:32] Good morning joal. Ask away. Would you like to batcave? [08:57:45] If you don't mind, it'll be easier :) [08:57:51] We have an SRE sync in 3 minutes. Would you like to join that? [08:58:04] meet.google.com/ort-mznr-eeu [08:58:07] btullis: I'll join and particiapte if htere is free time :) [08:58:10] Thanks :) [08:59:27] Reminder for anyone: in about 30 minutes, I will be rebuilding s2-analytics-replica.eqiad.wmnet and it will be down for a while. T343109 [08:59:27] T343109: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 [09:39:18] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * Configured downtime for the s2 services and systemd icinga check on dbstore1007. * Created a backup of the permissions: ` root@dbstore1007:~... [09:46:14] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * Checked https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance and confirmed that there is no maintenance on s2 (nor s3, s4) current... [09:52:46] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:46] (SystemdUnitFailed) resolved: (2) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:18] !log stopping and deleting s2 on dbstore1007. [10:02:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:04:00] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) * disabled puppet on dbstorew1007 * stopped the slave threads, * stopped the service * deleted /srv/sqldata.s2 ` root@dbstore1007:~# systemctl... [10:06:45] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Running the following as root in a screen session on cumin1001. ` root@cumin1001:~# transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/b... [10:07:13] (DiskSpace) firing: Disk space an-worker1128:9100:/var/lib/hadoop/data/c 5.927% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-worker1128 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:07:33] !log transferring snapshot s2.2023-10-23--01-34-18 from dbprov1004 to dbstore1007:/srv/sqldata.s2 [10:07:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:37:13] (DiskSpace) firing: (2) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 5.577% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:45:13] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) `an-test-client1002` is running on ganeti1010. ` btullis@ganeti1027:~$ sudo gnt-instance list an-test-client1002.eqiad.wmnet... [10:47:46] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:46] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:03] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) > What's it with the deprecation of a central statsd server. I think we need statsd for Prometheus logging, am I right? You're right in a w... [11:24:15] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) The transfer compled successfully. ` 2023-10-24 10:06:09 INFO: About to transfer /srv/backups/snapshots/latest/snapshot.s2.2023-10-23--01-34-1... [11:27:13] (03CR) 10Peter Fischer: [C: 03+1] cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson) [11:39:15] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Executed: ` chown mysql.mysql /srv/sqldata.s2 systemctl start mariadb:s2 ` Obtained the following GTID position. ` root@dbstore1007:/srv/sqldat... [11:46:19] 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook) [11:48:48] 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/30 [11:49:37] 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605 (10rook) [11:50:06] 10Quarry: Move quarry to magnum - https://phabricator.wikimedia.org/T349029 (10rook) [11:50:09] 10Quarry: Remove quarry.wsgi on move to k8s - https://phabricator.wikimedia.org/T349605 (10rook) [11:50:36] 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook) T349605 created and linked to T349029 [11:52:56] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) Restoring the user backup didn't go as well as I had hoped. ` root@dbstore1007:/srv/sqldata.s2# pt-show-grants -S /run/mysqld/mysqld.s2.sock --... [11:55:17] 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/30 [11:59:54] 10Quarry: Quarry not restarting off main branch - https://phabricator.wikimedia.org/T349603 (10rook) 05Open→03Resolved [12:03:58] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10Marostegui) You probably need to create that user, pt-show-grants output don't give you the create user, so you'll need to do that. ` create user `maria... [12:04:18] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I have manually created the `mariadb.sys@localhost` user with: ` CREATE USER `mariadb.sys`@`localhost` ACCOUNT LOCK PASSWORD EXPIRE; FLUSH PRIV... [12:09:12] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9275694, @Marostegui wrote: > You probably need to create that user, pt-show-grants output don't give you the create user, so yo... [12:11:18] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Data Products (Sprint 03)): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF) [12:17:38] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Great! I can now build a conda environment for testing this. e.g. ` (base) btullis@an-test-client1002:~$ conda create -n spa... [12:17:46] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF) [12:18:50] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10WDoranWMF) [12:19:22] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10hnowlan) This change has been deployed, and 404 errors have greatly dropped off. Please update if you see any persi... [12:28:57] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) Install wmfdata-python into my new environment. ` pip install git+https://github.com/wikimedia/wmfdata-python.git@v2.0.1 ` N... [12:34:20] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10JAllemandou) I think it's worth adding a query-logger :) I think it's worth spending the... [12:41:25] !log Drop wmf.referrer_daily hive table and data [12:41:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:48:45] (03CR) 10Gmodena: [C: 03+2] cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson) [12:50:57] 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/28 [12:51:05] (03Merged) 10jenkins-bot: cirrussearch/update_pipeline/update remove required fields [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/967478 (owner: 10Ebernhardson) [12:51:41] 10Quarry: Update helm for quarry on pr - https://phabricator.wikimedia.org/T349031 (10rook) [12:51:43] 10Quarry, 10Patch-For-Review: Create minikube deploy for quarry - https://phabricator.wikimedia.org/T301469 (10rook) 05Open→03Resolved [13:48:26] hi kevinbazira, quick note about https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api), we're currently trying to move requests from restbase to rest-gateway [13:49:00] sorry for the change, I'm still getting up to speed with the details, but just fyi, this is going to change slightly in the next few days [13:57:16] hi milimetric o/ [13:57:18] no problem. please let us know when the change has been made so that we can update the settings on our end. thank you. [14:02:27] kevinbazira: I've updated the docs to reflect the newer rest-gateway way - just a different listener/port, URLs stay the same https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api) [14:03:20] thanks hnowlan! [14:03:54] lemme know if you have any issues [14:04:21] sure sure [14:12:26] milimetric: o/ we hit the AQS API directly (via AQS LVS VIP), bypassing restbase. For internal calls, do you prefer clients to use rest-gateway? If not we are set :) [14:12:31] cc: hnowlan: --^ [14:16:33] as I understand it, we're trying to get any internal stuff off of restbase and onto rest-gateway, as services are available via that new route. What's the AQS LVS VIP? Like the public endpoint? That's fine, it'll be routed properly to rest-gateway behind the scenes. The idea is that we're looking to sunset AQS 1.0 as soon as possible [14:16:35] elukey: which endpoints do you use it for? everything but edits, bytes_difference and edited_pages are migrated to the rest-gateway [14:17:04] (and those will be migrated imminently) [14:17:06] it'll respond internally for all requests though [14:17:07] hnowlan: aqs.discovery.wnnet basically [14:17:18] via the aqs mesh endpoint (k8s) [14:17:45] if you add the listener for rest-gateway and change the mesh port to rest-gateway you should get identical behaviours [14:17:47] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:51] (although it's backed by different services) [14:17:54] no rush right now though [14:18:32] hnowlan: the apis are different, but we can surely adjust. I thought that AQS discovery was preferred vs using rest-gateway (more for external users) [14:19:51] elukey: which apis are different? rest-gateway *should* handle identical paths [14:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:59] elukey: aiui ultimately all requests will go to rest-gateway and we'll deprecate aqs1 entirely - we don't have a timeline for that though [14:20:14] The `aqs` envoy listener sends it via the LVS direct to the nodejs services that is co-located with the cassandra cluster in eqiad. This nodejs service is what is being deprecated. [14:20:44] hnowlan: do you mean same as restbase? Because the AQS nodejs has a different API, this is what I meant.. but if it will be deprecated makes sense [14:20:45] That's why the advice I have you last week (to use the `aqs` listener) was outdated. [14:21:00] elukey: ahhh [14:21:11] elukey: yeah I meant restbase [14:21:48] we can definitely move to rest-gateway, are there any examples about how to query the pageview API? [14:21:54] to be clear - A proper plan and timeline around deprecation etc is yet to be established, it'd just be good to avoid *new* stuff being built [14:21:56] ah sorry same as restbase, my bad [14:21:57] okok [14:22:13] (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 2.866% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:22:16] we can move to rest-gateway, will work with kevinbazira on this [14:23:01] lemme know if you need any help or we can make any improvements! [14:23:51] sure! [14:26:17] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I created the role manually: ` CREATE ROLE 'research_role'; ` The the other two commands worked. ` GRANT `research_role` TO `research`@`10.%`;... [14:28:55] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) 05Open→03Resolved [14:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:47] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:11] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10jcrespo) As a followup- consider in the future documenting the special grants on puppet. We don't have a good solution to monitor and assign them, but a... [14:54:53] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9276578, @jcrespo wrote: > As a followup- consider in the future documenting the special grants on puppet. We don't have a good... [15:00:55] (03CR) 10Sbisson: [C: 03+1] T343183 add "story share" event; add "user_is_anonymous" field and bump to version 1.1.0 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965846 (owner: 10Conniecc1) [15:00:59] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) I have now added those missing role grants: ` GRANT USAGE ON *.* TO research_role; GRANT SELECT ON `wikishared`.* TO research_role; GRANT SELEC... [15:02:47] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:47] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:16] 10Data-Engineering, 10serviceops, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) I tried to profile nodejs' code via `-prof` and `--prof-process` to have a better view of the CPU usage. I tried first with `perf` but I didn't obtain useful info... [15:29:49] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: [Event Platform] eventutilities-python should convert pyflink Instants to python DateTimes - https://phabricator.wikimedia.org/T349640 (10Ottomata) [15:52:14] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) [15:54:26] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Metrics, 10Data Engineering and Event Platform Team (Sprint 4), 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10Antoine_Quhen) [16:46:11] !log Deploying latest DAGs to analytics Airflow instance [16:46:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:16:01] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) a:03rook [17:18:18] (03CR) 10Nettrom: [C: 04-1] "Things largely look good, and there are a couple of outstanding issues I'd like to see resolved before we deploy this." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [17:18:36] 10Data-Engineering-Planning, 10Event-Platform (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10CodeReviewBot) otto opened https://gitlab.wikimedia.org/repos/... [17:19:30] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10CodeReviewBot) otto opened https://gitlab.wikimedia.... [17:31:52] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] mw-page-content-change-enrich should not retry on badrevids if no replica lag - https://phabricator.wikimedia.org/T347884 (10Ottomata) [17:32:47] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:43] 10Data-Engineering, 10EventStreams, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Ottomata) I've upgraded eventgate instances in deployment-prep / beta. Things look... [17:47:47] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:51:57] PROBLEM - Disk space on Hadoop worker on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:08:13] PROBLEM - Disk space on Hadoop worker on an-worker1146 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:22:13] (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.1228% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:28:21] RECOVERY - Disk space on Hadoop worker on an-worker1146 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:32:13] (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.01951% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:35:11] RECOVERY - Disk space on Hadoop worker on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:37:13] (DiskSpace) firing: (3) Disk space an-worker1128:9100:/var/lib/hadoop/data/c 0.0002979% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:42:55] xcollazo: Heya - I think it would have been a good idea not to let airflow run the missed metadata_ingest_daily [18:42:58] Schedule: @daily info Next Run: 2023-10-23, 00:00:00 [18:42:59] runs [18:43:37] it ran consecutively a few instance of each task, while it could have been only one :) [18:43:50] no big deal though, just a thought [18:44:20] got it, makes sense. next time! [18:44:27] xcollazo: cheers :) [18:44:40] xcollazo: another question - are you rebuilding a new dumps table? [18:45:04] I assume som, based on the latest PRs I have seen - probably with month-partitioning (yeah!) [18:45:40] If so, I wonder if the issues we're seeing on disk-space usage could be realted to a backfilling job [18:45:40] that `run_hive_event_sanitized_ingestion` task from metadata_ingest_daily runs for almost two hours... [18:46:08] xcollazo: millions of partitions :) [18:46:13] > are you rebuilding a new dumps table? [18:46:14] yes [18:47:24] >If so, I wonder if the issues we're seeing on disk-space usage could be realted to a backfilling job [18:47:24] likely, but this one will be months partitioned as you mention, so as soon as it clears, I will nuke the old wikitext_raw_rc1, which will give us back ~7M inodes. [18:47:40] Yes, I know that :) [18:48:02] wait... [18:48:05] I'm trying to build an understanding as to why we get so much data into a so small number of partitions [18:48:06] You said space [18:48:11] xcollazo: I think it's temporary data [18:48:19] xcollazo: I said space :) [18:48:21] indeed! [18:48:30] For once, not files-number related :) [18:49:10] ok yes this tracks [18:49:17] I am running a giant shuffle right now [18:49:19] https://yarn.wikimedia.org/proxy/application_1695896957545_164003/SQL/execution/?id=4 [18:50:25] ~36.2 TiB shuffle write [18:50:30] easy stuff [18:50:33] :D [18:52:29] I wonder though, I've run this pipeline before a couple times, don't recall space issues. [18:52:54] xcollazo: I think the monthly partitionning could change the game [18:53:24] by condensing some skewed data into some workers [18:54:44] example: For the stage having 6 tasks failed, they all were onto an-worker1146 - which has been showing space issues [18:55:24] 3 of the failed tasks are due to java.io.IOException: No space left on device [18:55:36] Nah, actually all of them are [18:59:27] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [19:01:06] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/merge_requests/7 Update python to 3.10 [19:05:48] joal: Ok things look better now. Data sort phase is now ongoing, and I expect that to take ~15 hours. Will monitor it. [19:06:36] The graphs are fun xcollazo - they show shuffled data being handled as temporary files through capacity remaining: no change in hdfs-used-space, but capacity drop, and now we'll see HDFS usage bump as well as capacity drop, and when job finishes, the capacity will get back [19:07:01] xcollazo: You remember: same hardware storage space for temporary data and HDFS data! [19:11:13] joal: right right. Maybe if I keep breaking the cluster I'll eventually get discrete hard drive pools? :D [19:11:27] xcollazo: Thanks for keeping an eye on it - I'll check tomorrow my morning as well [19:11:40] xcollazo: ;D [19:11:51] Gone for tonight folks! [19:14:14] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [19:14:36] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) a:03Ottomata [19:38:25] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [19:39:44] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [19:47:57] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [19:48:55] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) We'd like to do this soon. I'm aiming for November 6th.... [19:50:44] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/527 search: Update mj... [19:55:46] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10CodeReviewBot) ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/527 search: Update mj... [19:57:37] Starting build #28 for job wikimedia-event-utilities-maven-release-docker [20:00:46] Project wikimedia-event-utilities-maven-release-docker build #28: 09SUCCESS in 3 min 9 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/28/ [20:11:25] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) TL;DR: I was able to verify that Spark 3.1.2 and Spark 3.3.2 work as expected on the test cluster 🎉 . I ran out of time, bu... [20:14:16] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10xcollazo) > cc @xcollazo for Data Products Sounds good to me. [20:19:29] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, and 2 others: [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Deployed the above chan... [20:21:53] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10xcollazo) BTW, just fixed permissions on my Spark 3.3.2 assembly file to readable by all for if other folks want to repro the above n... [20:23:44] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 4), 10Event-Platform, 10Patch-For-Review: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) Draft announcement: https://docs.google.com/document/d/1Lw... [20:31:28] 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) [20:32:05] 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) [20:32:57] 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) [20:43:55] 10Data-Platform-SRE, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) [20:47:12] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar, 10Epic: [EPIC] Improve helm chart development experience - https://phabricator.wikimedia.org/T349666 (10bking) [22:15:41] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Document process for getting JNL files/consider automation - https://phabricator.wikimedia.org/T347605 (10bking) 05Open→03Resolved p:05Triage→03Low >>! In T347605#9249978, @dr0ptp4kt wrote: > @bking just wanted to express my gratitude for the... [22:37:28] (DiskSpace) firing: Disk space analytics1075:9100:/var/lib/hadoop/data/l 0.6665% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [22:54:27] PROBLEM - Disk space on Hadoop worker on analytics1075 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [23:40:21] RECOVERY - Disk space on Hadoop worker on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [23:42:13] (DiskSpace) resolved: Disk space analytics1075:9100:/var/lib/hadoop/data/l 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=analytics1075 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace