[00:21:55] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:07] 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk) [01:40:03] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:00:43] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:53] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:16:43] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:27] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [05:41:15] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210627T0700) [08:05:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:12:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:22:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:23:40] !log restart php-fpm on mw1401 [08:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:34:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:37:47] !log restart php-fpm on mw1268 mw1269 - low busy workers [08:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:01] err low idle workers of course [08:39:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:10:15] !log roll restart the remaining mw appservers to clear out apcu framentation (cumin command to follow) [09:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:26] !log cumin 'A:mw-eqiad and not P{mw13[67,54,55,72,33,50,51,73,52,49,53,65,71,84,68,70,66,91,89,97,95,99,85,93,87]*} and not P{mw14[09,03,11,07,05,01]*} and not P{mw12[61-69]*} and not P{mwdebug*}' '/usr/local/sbin/restart-php7.2-fpm' -b 1 -s 30 [09:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:08] all appservers restarted, latency looks really good now [09:25:22] * elukey afk [14:34:31] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:35:18] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:30:14] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10CDanis) [16:30:21] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10CDanis) p:05Triage→03High [16:30:34] elukey: thanks for all your whack-a-mole; I opened a tracking task ^ [16:30:34] (03PS1) 10Thiemo Kreuz (WMDE): Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636) [17:51:14] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) Hi @Aklapper, could you tell me a Wikimedia employee or system administrator who has access to the Search Console? I try to talk to one, on the Talk page of this user who quotes m... [17:54:25] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:01] 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10elukey) Some notes from the restarts: * The appservers with a lot of busy workers eventually fa... [18:21:06] cdanis: thanks! I added some notes from the restarts [18:55:13] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:39:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:45:27] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:32:54] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Aklapper) @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task is only about Edu's request and is closed. See also https://wikitech.wikimedia.org/wiki/Googl... [22:45:54] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) >>! Em T285091#7179757, @Aklapper escreveu: > @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task is only about Edu's request and is closed. > S... [22:58:10] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:39:29] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Bugreporter) >>! In T285091#7179770, @Juan90264 wrote: >>>! Em T285091#7179757, @Aklapper escreveu: >> @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task... [23:45:01] 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) >>! Em T285091#7179788, @Bugreporter escreveu: >>>! In T285091#7179770, @Juan90264 wrote: >>>>! Em T285091#7179757, @Aklapper escreveu: >>> @Juan90264: Hi, please ask general Sear... [23:52:50] (03PS2) 10Tim Starling: Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 [23:58:56] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook