[00:02:11] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:02:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:11:15] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 657.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:43] (03Abandoned) 10Andrew Bogott: puppetserver: move duplicate directory definition [puppet] - 10https://gerrit.wikimedia.org/r/1121702 (owner: 10Andrew Bogott) [00:25:56] (03PS10) 10BCornwall: varnish: Upgrade VCL for Varnish 7.0+/modules 0.20 [puppet] - 10https://gerrit.wikimedia.org/r/1113592 (https://phabricator.wikimedia.org/T378737) [00:32:17] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10572508 (10Reedy) {ce50baf54b2a3eb40517381c6d199267a68a4d20} originally added the group... With no linked task. Unfortunately my memory of these thing... [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121717 [00:39:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121717 (owner: 10TrainBranchBot) [00:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:50:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121717 (owner: 10TrainBranchBot) [01:09:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121718 [01:09:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121718 (owner: 10TrainBranchBot) [01:30:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121718 (owner: 10TrainBranchBot) [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/c131b7f9bc3319d653e893375d2082ab86c8d746bb0f664bb54ef1c1fd9ea209/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:10:17] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:37] (03PS1) 10Ryan Kemper: wdqs / scholia: add routing for scholia host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 [04:34:15] (03PS5) 10SD hehua: Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) [04:44:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:45:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:45] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [05:12:28] (03CR) 10CI reject: [V:04-1] Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [05:14:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10572671 (10phaultfinder) [05:27:21] RESOLVED: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:32:21] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:34] (03PS6) 10SD hehua: Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) [06:42:22] RESOLVED: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:47:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:33] RESOLVED: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:33] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:13:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:20:11] PROBLEM - snapshot of s7 in codfw on backupmon1001 is CRITICAL: Last snapshot for s7 at codfw (db2198) taken on 2025-02-22 06:20:54 is 832 GiB, but the previous one was 1159 GiB, a change of -28.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [07:25:28] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [08:07:22] RESOLVED: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:45:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:29] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:10:53] PROBLEM - SSH on ml-lab1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:21:49] RECOVERY - SSH on ml-lab1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:27:35] PROBLEM - Host ml-lab1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:05] RECOVERY - Host ml-lab1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [10:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:53:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:54:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3276 MB (3% inode=98%): /tmp 3276 MB (3% inode=98%): /var/tmp 3276 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:44:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10572967 (10cmooney) DC-Ops folks Nokia reccomend trying to interrupt the grub bootloader for lsw5-a8-codfw to see if we can boo... [14:01:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10573017 (10phaultfinder) [15:53:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:02:40] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10573065 (10Urbanecm) Noting @jrbs was added to the group in T220860, in order to be able to run changePassword.php on Wikitech (which back then was lin... [16:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10573095 (10phaultfinder) [17:03:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:18:27] (03PS1) 10Aklapper: Disable weekend rule [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121743 [18:19:06] (03CR) 10Aklapper: [V:03+2 C:03+2] Disable weekend rule [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121743 (owner: 10Aklapper) [18:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10573120 (10phaultfinder) [18:33:01] (03PS1) 10Aklapper: Add Logstash / phlog debug output when users get close to threshold [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121744 [18:33:57] (03CR) 10Aklapper: [V:03+2 C:03+2] Add Logstash / phlog debug output when users get close to threshold [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121744 (owner: 10Aklapper) [18:35:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:35:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:36:17] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:43:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:07] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:44:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:15] (03PS1) 10Aklapper: Make antivandalism.edit-period-hours docs mention result limitations [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121745 [19:22:54] (03CR) 10Aklapper: [V:03+2 C:03+2] Make antivandalism.edit-period-hours docs mention result limitations [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121745 (owner: 10Aklapper) [19:45:00] (03PS1) 10Aklapper: Replace DB transaction query with FeedQuery to estimate recent edits [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121746 (https://phabricator.wikimedia.org/T386704) [19:48:23] (03CR) 10Aklapper: "I'd not want to merge this for the next deployment but one after, as there have been many changes in this codebase, and this would be the " [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121746 (https://phabricator.wikimedia.org/T386704) (owner: 10Aklapper) [19:53:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:22:22] FIRING: [6x] SystemdUnitFailed: user@0.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:19] 06SRE, 06Traffic, 10Wikidata, 06Wikidata Dev Team, 07Performance Issue: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10573215 (10Bugreporter) [20:23:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:34:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10573274 (10phaultfinder) [22:06:51] 06SRE, 10Prod-Kubernetes, 06Traffic, 10Wikidata, and 3 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10573276 (10Addshore) Interesting, coming here from the merged in ticket, there are some logs that might be relevant, relating... [22:37:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 802.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 806.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:43:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10573302 (10phaultfinder)