[03:24:46] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:28:06] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:43:55] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:26:03] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:08:17] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:38:02] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors with hard drive [08:52:00] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10jnuche) @Ottomata @gmodena the doc publishing for this project is [[ https://... [08:57:59] PROBLEM - SSH on an-worker1110 is CRITICAL: connect to address 10.64.36.142 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:14] ^ This failure of an-worker1110 to boot is me. I was trying to reboot it to get a disk back, before putting in a ticket to replace it. Looks like the boot is held up by the failure. I will act the alaert and check it out. [09:11:29] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:35:53] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:25] RECOVERY - SSH on an-worker1110 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:27] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:05] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:53:37] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:16:22] !log restart turnilo to pick up config changes - T340097 [10:16:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:16:25] T340097: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 [10:19:55] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10BTullis) I have fixed the errors with the other drive on this host. It didn't boot, so I had to comment out `/var/lib/hadoop/data/d` from `/etc/fstab` and reboot again. I then checked the state of the physical disks... [10:20:19] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors with hard drive [10:20:42] !log reboot an-worker1110 after initializing a second replacement drive for T336929 [10:20:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:20:46] T336929: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 [10:22:14] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/merge_requests/... [10:26:19] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) I believe that this is now complete. I've merged the build script into our branch and created som... [10:30:26] 10Data-Engineering, 10Data-Platform-SRE: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10Stevemunene) stat1009 is experiencing a jupyterhub error whenever a single user tries to create a server. Created an SSH tunnel with: `ssh -N stat1009.eqiad.wmnet -L 8880:127.0.0.1:8880` Amd logg... [11:10:40] (03CR) 10Joal: "I have not checked the code in detail, it looks good overall to me. Iadded somecomments about the endpoint definition, and we miss the new" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [11:28:23] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:09] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Create new knowledge-gaps endpoint (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/915678 (https://phabricator.wikimedia.org/T337059) (owner: 10Nick Ifeajika) [11:49:27] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:11:04] (03PS4) 10Aqu: Use canonical_data countries maintained by analytics-product [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) [12:29:32] (03CR) 10Aqu: "Thanks, Joal. I've switched to the anti-joins syntax with a broadcast hint." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/929723 (https://phabricator.wikimedia.org/T338033) (owner: 10Aqu) [12:31:37] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:40:38] !log move varnishkafka drmrs instances to pki - T337825 [12:40:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:40:40] T337825: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 [12:52:39] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:56:07] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) gmodena opened https://gitlab.wikimedia.org/repos/data-enginee... [12:56:43] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10gmodena) thanks for heads up @jnuche . [13:07:21] (03Abandoned) 10Aqu: Debug missing dependencies problem toImmutableList [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/922063 (owner: 10Aqu) [13:13:08] Awesome nuria! Way to stick with it, I love the post. [13:24:15] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:33:32] varnishkafkas in Marseille migrated [13:33:36] only esams left [14:05:52] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:37:06] PROBLEM - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:41:29] 10Data-Platform-SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) [14:42:41] 10Data-Platform-SRE, 10ops-eqiad: Replace RAID controller battery in an-worker1092 - https://phabricator.wikimedia.org/T340204 (10BTullis) [14:44:12] ACKNOWLEDGEMENT - MegaRAID on an-worker1092 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Requested a battery replacement in T340204 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:08:16] RECOVERY - MegaRAID on an-worker1092 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:27:04] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 (10elukey) @Stevemunene I still see the following from the hdfs topology: ` Rack: /eqiad/default/rack 10.64.21.113:50010 (analytics1061.eqiad.wmnet) 10.64.2... [15:46:39] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10Ottomata) TY! Merged. [15:46:43] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineerin... [16:39:49] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10xcollazo) > I'm still not convin... [16:40:22] btullis: Heya - are you nearby [16:40:24] ? [16:40:45] Yes, right here. [16:41:06] btullis: do you wish to spend a minute on the airflow test-cluster issue you had? [16:41:23] Can wait till monday without problem [16:41:26] :) [16:42:31] Let's look at it now. [16:42:38] ack! to the batcaev [17:09:13] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've made more progress with this, but it's still not quite there. I had to make a couple of tweaks to the networkpolicy and populate some secrets. Now when the system update job runs it... [17:10:53] 10Data-Engineering: Determine which team should own airflow1005/update contact info - https://phabricator.wikimedia.org/T334522 (10bking) a:03bking This appears to be resolved...closing. [17:11:20] 10Data-Engineering: Determine which team should own airflow1005/update contact info - https://phabricator.wikimedia.org/T334522 (10bking) 05Open→03Resolved [17:11:21] Just as an FYI, I just merged a minor puppet change ^^ [17:11:44] not sure it matters since we're merging SRE teams ;) [17:12:53] inflatador: Good stuff. That would have been my suggestion too :-) [17:13:29] Yeh...just cleaning out my sadly neglected gerrit queue [18:39:30] (03PS1) 10Neil Shah-Quinn (WMF): wikipediapreview_stats: Remove entirely [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/932458 (https://phabricator.wikimedia.org/T333218) [18:39:34] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/932458 (https://phabricator.wikimedia.org/T333218) (owner: 10Neil Shah-Quinn (WMF)) [20:04:24] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), and 2 others: Adopt conventions for server receive and client/event timestamps in non analytics event schemas - https://phabricator.wikimedia.org/T267648 (10JAllemandou) > I'm still not con...