[00:14:37] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:18:25] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:55] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:03:37] PROBLEM - Host cp5001 is DOWN: PING CRITICAL - Packet loss = 100% [01:07:13] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:12:01] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:00] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [03:01:35] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:13:01] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:47:21] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:55:53] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:24:33] (03PS1) 10Stang: viwikibooks: Change wgArticleCountMethod to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818599 (https://phabricator.wikimedia.org/T314239) [06:34:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220731T0700) [07:01:27] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:47:11] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:21:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:32:59] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:57:53] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:33] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: OpenSent - Anycast, AS64605/IPv4: OpenSent - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:07:57] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:07:58] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:08:56] here [09:10:35] high error rate in india at NEL [09:11:16] seems to be related to upload [09:12:27] there was a spike in requests: https://grafana.wikimedia.org/goto/pE5PAck4z?orgId=1 [09:12:53] so not considering depooling it [09:12:58] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:12:58] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:14:55] I am monitoring librenms, some of the transports look a bit saturated [09:17:00] here as well [09:17:21] nothing ongoing for the moment right? [09:17:59] yeah, I am just monitoring [09:18:06] it seems back to normal slowly [09:19:53] it hit only one router, so I wonder if it was a specific peer only? [09:22:22] no, I was wrong, it was multiple transports/peers [09:25:21] I am checking netflow to see if there is any info [09:26:32] I have narrowed it down quite a lot, sending link and I think it will be clear from it :-D [09:30:11] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:04:29] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:15:55] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:50:15] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:01:41] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:18:36] 10SRE: postorius list overview should be sorted - https://phabricator.wikimedia.org/T314246 (10Krd) [11:27:13] 10SRE: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247 (10Krd) [11:32:56] 10SRE: MM§/postorius: suppress owner notification while subscribing or unsubscribing users - https://phabricator.wikimedia.org/T314248 (10Krd) [11:33:04] 10SRE, 10Wikimedia-Mailing-lists: postorius list overview should be sorted - https://phabricator.wikimedia.org/T314246 (10Peachey88) [11:33:12] 10SRE: MM3/postorius: suppress owner notification while subscribing or unsubscribing users - https://phabricator.wikimedia.org/T314248 (10Krd) [11:33:17] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247 (10Peachey88) [11:34:12] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: suppress owner notification while subscribing or unsubscribing users - https://phabricator.wikimedia.org/T314248 (10Peachey88) [11:35:59] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:37:21] 10SRE: MM3/postorius interface language - https://phabricator.wikimedia.org/T314249 (10Krd) [11:39:31] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius interface language - https://phabricator.wikimedia.org/T314249 (10Peachey88) [11:41:46] 10SRE: MM3/postorius: unclarity of the remove-all button - https://phabricator.wikimedia.org/T314250 (10Krd) [11:42:22] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: unclarity of the remove-all button - https://phabricator.wikimedia.org/T314250 (10Peachey88) [11:55:07] 10SRE: MM3/postorius: cannot use multiple accounts - https://phabricator.wikimedia.org/T314251 (10Krd) [11:58:10] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: cannot use multiple accounts - https://phabricator.wikimedia.org/T314251 (10Peachey88) [12:11:57] 10SRE: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Krd) [12:12:41] 10SRE: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Krd) I should be possible for the user to unsubscribe from the list directly out of the received list e-mail. [12:45:23] 10SRE: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Peachey88) The user should be able to email -leave address for the list, This used to be documented in the MM2 email footers, This may need to be manually readded in the MM3 footers.... [12:46:28] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Peachey88) [13:07:23] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:41:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:00:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:27] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:10:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:39] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:29] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:20:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:51] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:25:19] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.677 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:25:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:29] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [14:35:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:36:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:49] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:40:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:09] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:45:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:50:15] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:50:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:28] I'd say there is only a bit of cloggage due to: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/SENTHAMIZHSELVI_A&ilshowall=1 [15:01:58] <_joe_> jynus: nothing is really wrong, apart from this probe paging [15:02:19] <_joe_> we're processing a lot of videos [15:04:27] but that's the thing- http should -ideally- be responsive even if at saturation (tune concurrency) but not a bit worry [15:05:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:10] <_joe_> these are jobrunners, they don't need to be responsive to whatever probe is being called [15:09:43] yeah, I mean it from an obs. point of view, not a functional/app layer point of view [15:09:55] <_joe_> jobs are being processed [15:10:02] <_joe_> videos are being transcoded [15:10:11] <_joe_> wait times went up for videos, now going down [15:10:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:24] <_joe_> from an obs point of view, the jobqueue is working perfectly [15:10:28] the other thing I miss is obs. on graphs, I had to go to logs on servers to debug [15:10:39] <_joe_> jynus: wdym? [15:10:47] <_joe_> jobrunners have a flurry of obs data [15:11:35] <_joe_> red dashboard: https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1; jobqueue details: https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 [15:11:46] <_joe_> there's more, but those two have most of the info you need [15:12:15] the app servers yes, the app, no so much IMHO [15:12:49] <_joe_> if we want the alert to stop firing, we need to depool completely the servers that serve videoscaling from the jobrunner pool [15:12:54] <_joe_> but... I don't think it's needed [15:13:14] <_joe_> What I think is happening here is a mismatch between the probe on prometheus and the probe pybal uses [15:13:24] <_joe_> probably a shorter timeout in the prometheus probe [15:13:41] <_joe_> so it's sending requests to servers that are responding a bit slow [15:13:46] <_joe_> but not excluded from pybal [15:14:35] <_joe_> anyways, things to check for tomorrow [15:15:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:06:01] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:37] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:38:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:47] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:02:31] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:10:53] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:15:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48390 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:15:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:33] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:39:45] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:02] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Rolling this out to the high-traffic wikis will be a little bit tricky. When we turn it on, we can expect the cache hit rate to go... [18:11:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10ssingh) [18:12:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet,service=ats-be [18:12:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet,service=varnish-fe [18:12:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet,service=ats-tls [18:13:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256 [18:13:52] T314256: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 [18:14:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256 [18:19:47] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp5001.eqsin.wmnet [18:21:53] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5001 memory errors on DIMM A2 - https://phabricator.wikimedia.org/T314256 (10Vgutierrez) p:05Triage→03Medium I've set it as inactive rather than just depool it to let pybal ignore it regarding depooling threshold [18:27:49] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:49:39] (03PS1) 10Stang: newiki: Update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818614 (https://phabricator.wikimedia.org/T311700) [19:02:13] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:13:52] 10SRE, 10Domains, 10Traffic, 10SecTeam-Processed, 10Security: domain name Wikkipedia.be - https://phabricator.wikimedia.org/T313823 (10RLazarus) [20:13:59] 10SRE, 10Domains, 10Traffic, 10SecTeam-Processed, 10Security: domain name Wikkipedia.be - https://phabricator.wikimedia.org/T313823 (10RLazarus) Yep. [20:45:13] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:08:07] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:43:31] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) I've been thinking about this problem in recent days; long-term, we will most likel... [21:43:41] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:10] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: cannot use multiple accounts - https://phabricator.wikimedia.org/T314251 (10Legoktm) > I know that with MM3 you can use several lists with one account. For organizational reasons it does make sense for me to use different accounts for different kinds of lists, i... [21:49:34] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius interface language - https://phabricator.wikimedia.org/T314249 (10Legoktm) [21:50:09] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: Let users choose the UI language - https://phabricator.wikimedia.org/T281747 (10Legoktm) [21:51:30] 10SRE, 10Wikimedia-Mailing-lists: MM3/postorius: incomprehensible/overcomplicated unsubscription for end users - https://phabricator.wikimedia.org/T314252 (10Legoktm) Yeah, the MM3 unsubscribe options are bad. FWIF they're documented at https://meta.wikimedia.org/wiki/Mailing_lists/Unsubscribing [22:05:21] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:20:00] (03PS1) 10Krinkle: tests: Add test case to confirm private wiki settings work correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818623 (https://phabricator.wikimedia.org/T169821) [22:20:49] (03CR) 10CI reject: [V: 04-1] tests: Add test case to confirm private wiki settings work correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818623 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:42:40] (03PS2) 10Krinkle: tests: Add test case to confirm private wiki settings work correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818623 (https://phabricator.wikimedia.org/T169821) [22:42:42] (03PS1) 10Krinkle: build: Attempt to work around unknown bug in CI job with PHPCS exclude [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818625 [22:46:24] (03CR) 10Krinkle: [C: 03+2] build: Attempt to work around unknown bug in CI job with PHPCS exclude (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818625 (owner: 10Krinkle) [22:46:30] (03CR) 10Krinkle: [C: 03+2] tests: Add test case to confirm private wiki settings work correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818623 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:47:09] (03PS11) 10Krinkle: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (https://phabricator.wikimedia.org/T169821) [22:47:15] (03Merged) 10jenkins-bot: build: Attempt to work around unknown bug in CI job with PHPCS exclude [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818625 (owner: 10Krinkle) [22:47:17] (03Merged) 10jenkins-bot: tests: Add test case to confirm private wiki settings work correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818623 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:47:51] (03PS11) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) [22:48:26] (03CR) 10CI reject: [V: 04-1] multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [22:49:16] (03PS12) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) [22:50:48] * Krinkle staging on mwdebug1002 for benchmarking [22:52:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:53:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:53:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:54:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:09:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:14:01] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:15:03] (03CR) 10Krinkle: [C: 03+2] multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:15:55] (03Merged) 10jenkins-bot: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:17:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:20:00] !log krinkle@deploy1002 Synchronized dblists-index.php: I814ee93b5c (duration: 03m 20s) [23:22:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:25:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:25:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:29:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:46:31] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:48:19] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:48:29] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:59:47] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring