[00:01:10] 10Quarry, 10Documentation, 10User-EpicPupper: Improve Superset documentation - https://phabricator.wikimedia.org/T337342 (10EpicPupper) a:03EpicPupper [02:50:46] (03CR) 10Milimetric: [C: 03+2] Add iceberg version of referrer_daily table. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/917404 (https://phabricator.wikimedia.org/T335305) (owner: 10Xcollazo) [02:54:17] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "sorry I was so late in getting back to this! We should probably have some kind of alert if anyone is waiting on code review for more than" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/895737 (owner: 10Nmaphophe) [02:54:50] (03CR) 10Milimetric: [V: 03+2 C: 03+2] GDI Equity Landscape Tables (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/895737 (owner: 10Nmaphophe) [03:21:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:57] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:56:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:26] 10Quarry, 10cloud-services-team (Kanban): quarry-nfs-1 went down; quarry is offline - https://phabricator.wikimedia.org/T302154 (10Framawiki) >>! In T302154#7724373, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-cloud), href=https://sal.toolforge.org/log/m9muGH8B1jz_IcWu6pro} [2022-02-20... [15:10:16] 10Quarry: Unable to login Quarry - https://phabricator.wikimedia.org/T337588 (10Aklapper) 05Open→03Invalid > cannot fully answer these questions at the official site If you use flaky internet access and third-party websites, then I am afraid we cannot help with third-party websites. In the future please AL... [16:03:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:32:37] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Tacsipacsi) https://replag.toolforge.org/ shows similarly high replication lag on s3 as well, even though the task description states that it should be working as norma... [17:56:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:26] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) Correct, looks like to broke too yesterday. Thanks for the heads up [18:10:18] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Pppery) [18:41:07] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [19:44:37] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) I am puzzled - these all broke again, and I don't understand why, they were just recloned and there were no crashed in between. db1154:s1 db1154:s5 db1155:... [19:45:59] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [19:58:48] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) I just did a data check between s1 master (running SBR) and db1196 (sanitarium master) for `user_properties` and the data is the same. So there's definitely... [19:59:17] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) >>! In T337446#8885162, @Marostegui wrote: > > @Ladsgroup is there any process going on for user_properties or something? I have seen most of the crashes re... [20:01:33] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) For shutting down sanitarium masters, or any replicas, the script first stops replication so it shouldn't cause data corruption in case of uncommitted transa... [20:05:02] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) We can take a look at the binlogs of other production replicas to see if there terrible things happening that SBR safely ignores. We have the transactions th... [20:10:30] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) If there would be any unsafe statement they'd have shown up in the data check I just did (I guess) [20:17:02] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) hmm, what I'm seeing, all of these are happening on different tables, like the s3 one is one templatelinks in urwiki but they have one thing in common: They... [21:53:13] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) As more of a test, I think it'll break again soon but to understand. The transaction that broke s3 is this: ` #230525 6:07:40 server id *redacted* end_log_... [21:56:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:22] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 6 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Zabe) [22:06:01] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Data-Persistence (work done), 10Platform Engineering Roadmap Decision Making, and 6 others: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Zabe) [22:19:57] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Ladsgroup) I think something is replaying transactions twice sometimes (and probably in 10.4.29). e.g. for the s1 broken replication, in db1154 it says it can't delete...