[00:02:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[01:20:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:21:44] <wikibugs>	 10Data-Engineering-Planning: analytics-platform-eng admins should be able to restart airflow platform-eng systemctl services - https://phabricator.wikimedia.org/T313727 (10Ottomata) 05Open→03Resolved
[01:23:01] <wikibugs>	 10Data-Engineering, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) Should be related to {T309717}.  We merged a logging change to help debug, but were delayed on deploying it because of the Spark...
[01:24:13] <wikibugs>	 10Data-Engineering, 10Data Pipelines: Event Platform canary events job occasionally fails to retrieve stream config settings - https://phabricator.wikimedia.org/T326002 (10Ottomata) But yeah, we should move this to airflow for sure.  Its similar to but simpler than Refine -> Airflow.
[02:45:30] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:19:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[05:03:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:34:50] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:04:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[06:06:34] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:59:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:31:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:45:10] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:16:56] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1080 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:27:30] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1080 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:30:13] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp5032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[09:35:13] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: varnishkafka on cp5032 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp5032%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[10:06:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[10:20:17] <btullis>	 !log restart hive-server2 and hive-metastore services on an-coord1002 prior to failover
[10:20:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:33:20] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:39:25] <btullis>	 !log fail over hive services to an-coord1002 with change to the DNS CNAME for analytics-hive.eqiad.wmnet
[10:39:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:45:58] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[11:01:43] <wikibugs>	 10Data-Engineering-Radar, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) GeoIP data copied to all mw-on-k8s kubernetes hosts.
[11:05:19] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) I've downtimed the host, shut it down, and created a hardware ticket for @Cmjohnson to replace the RAID controller battery: T326127
[11:06:13] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis) a:03BTullis
[11:06:52] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Check BBU on an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T325984 (10BTullis)
[11:08:14] <btullis>	 !log restarted hive-server2 and hive-metastore services on an-coord1001 after failover to standby server
[11:08:15] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:09:35] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Given that the physical disk IDs are qll sequential and well ordered according to the iDRAC card, the results from `lsscsi` in the Deb...
[12:36:01] <wikibugs>	 (03PS2) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 100…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736
[12:37:49] <wikibugs>	 (03PS2) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 100…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735
[12:39:11] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add gor.wiktionary to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/874831 (https://phabricator.wikimedia.org/T326139)
[12:49:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 100…599 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/836735 (owner: 10Thiemo Kreuz (WMDE))
[12:49:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Limit HTTP status code to 100…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE))
[13:09:39] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) Here is the dmesg output from the first two devices detected: ` [   69.924961] mpt3sas_cm0: port enable: SUCCESS [   69.925911] scsi 0...
[13:54:32] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10LSobanski) I don't believe there are any specific actions for #SRE here, untagging.
[14:34:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:07] <wikibugs>	 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet wi...
[15:36:44] <wikibugs>	 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov)
[15:37:25] <wikibugs>	 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10mpopov)
[15:37:35] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye
[15:44:39] <wikibugs>	 10Data-Engineering, 10CheckUser, 10MW-1.38-notes (1.38.0-wmf.26; 2022-03-14), 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), and 4 others: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 (10Milimetric) >>! In T233004#8491382, @Zabe wrote: >>>! In T233004#7796847, @Mil...
[15:58:08] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of akhatun - https://phabricator.wikimedia.org/T326157 (10MoritzMuehlenhoff)
[16:03:17] <bearloga>	 Is anyone available to quickly hook up a new person on my team with a Kerberos identity? T325857
[16:03:18] <stashbot>	 T325857: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857
[16:04:22] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10BTullis) It turns out that SR-IOV was disabled in BIOS. I tried enabling it, but it didn't make any difference, so I reverted to having it disa...
[16:15:36] <btullis>	 bearloga: Yes, I can have a look now.
[16:16:10] <bearloga>	 btullis: Thank you, Ben!
[16:16:20] <wikibugs>	 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) a:03BTullis
[16:22:48] <wikibugs>	 10Data-Engineering, 10AQS 2.0 Roadmap, 10API Platform (API Platform Roadmap), 10Epic, and 2 others: AQS 2.0:Wikistats 2 service - https://phabricator.wikimedia.org/T288301 (10Milimetric) +1 on T288301#8487410, @BPirkle
[16:26:40] <wikibugs>	 10Data-Engineering: Requesting Kerberos identity for Hxi-ctr - https://phabricator.wikimedia.org/T325857 (10BTullis) I believe that the approvals given in the linked ticket {T325004} are sufficient to permit me to create the Kerberos principal, so I'll go ahead and do it now.  ` btullis@krb1001:~$ sudo manage_pr...
[16:29:11] <wikibugs>	 (03PS1) 10Ottomata: [POC] less-nested mediawiki/page/change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/874900 (https://phabricator.wikimedia.org/T308017)
[16:29:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [POC] less-nested mediawiki/page/change schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/874900 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[16:30:28] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) Did some work today seeing what the schema would look like flattened as discussed in [[ https://phabricat...
[16:56:57] <milimetric>	 mforns: come back to talk mw history?
[16:57:07] <mforns>	 yes!
[19:13:47] <wikibugs>	 10Data-Engineering, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Create partman recipe for cephosd servers - https://phabricator.wikimedia.org/T324670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host cephosd1001.eqiad.wmnet with OS bullseye execute...
[21:54:58] <wikibugs>	 (03CR) 10Mforns: "Left a comment, I think this might not work in some cases.." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/870971 (https://phabricator.wikimedia.org/T323614) (owner: 10Xcollazo)