[00:21:55] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:25:07] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Apr-Jun-2021), 10Datacenter-Switchover: CommRel support for June 2021 Switchover - https://phabricator.wikimedia.org/T281209 (10sgrabarczuk)
[01:40:03] <icinga-wm>	 PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:00:43] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:40:53] <icinga-wm>	 RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:16:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:16:48] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:35:27] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[05:41:15] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210627T0700)
[08:05:09] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:12:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:22:33] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:23:40] <elukey>	 !log restart php-fpm on mw1401
[08:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:29] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:34:07] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[08:37:47] <elukey>	 !log restart php-fpm on mw1268 mw1269 - low busy workers
[08:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:01] <elukey>	 err low idle workers of course
[08:39:55] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[09:10:15] <elukey>	 !log roll restart the remaining mw appservers to clear out apcu framentation (cumin command to follow)
[09:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:26] <elukey>	 !log cumin 'A:mw-eqiad and not P{mw13[67,54,55,72,33,50,51,73,52,49,53,65,71,84,68,70,66,91,89,97,95,99,85,93,87]*} and not P{mw14[09,03,11,07,05,01]*} and not P{mw12[61-69]*} and not P{mwdebug*}' '/usr/local/sbin/restart-php7.2-fpm' -b 1 -s 30
[09:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:08] <elukey>	 all appservers restarted, latency looks really good now
[09:25:22] * elukey afk
[14:34:31] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:35:18] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:30:14] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10CDanis)
[16:30:21] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10CDanis) p:05Triage→03High
[16:30:34] <cdanis>	 elukey: thanks for all your whack-a-mole; I opened a tracking task ^
[16:30:34] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): Hotfix for broken "Extract show all to placeholder class" [extensions/VisualEditor] (wmf/1.37.0-wmf.11) - 10https://gerrit.wikimedia.org/r/701644 (https://phabricator.wikimedia.org/T284636)
[17:51:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) Hi @Aklapper, could you tell me a Wikimedia employee or system administrator who has access to the Search Console? I try to talk to one, on the Talk page of this user who quotes m...
[17:54:25] <icinga-wm>	 PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:01] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10elukey) Some notes from the restarts:  * The appservers with a lot of busy workers eventually fa...
[18:21:06] <elukey>	 cdanis: thanks! I added some notes from the restarts
[18:55:13] <icinga-wm>	 RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:39:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:45:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 51 probes of 627 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:32:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Aklapper) @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task is only about Edu's request and is closed. See also https://wikitech.wikimedia.org/wiki/Googl...
[22:45:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) >>! Em T285091#7179757, @Aklapper escreveu: > @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task is only about Edu's request and is closed. > S...
[22:58:10] <icinga-wm>	 PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:39:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Bugreporter) >>! In T285091#7179770, @Juan90264 wrote: >>>! Em T285091#7179757, @Aklapper escreveu: >> @Juan90264: Hi, please ask general Search Console questions on-wiki instead. This task...
[23:45:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Access to ptwikinews Search Console for Edu - https://phabricator.wikimedia.org/T285091 (10Juan90264) >>! Em T285091#7179788, @Bugreporter escreveu: >>>! In T285091#7179770, @Juan90264 wrote: >>>>! Em T285091#7179757, @Aklapper escreveu: >>> @Juan90264: Hi, please ask general Sear...
[23:52:50] <wikibugs>	 (03PS2) 10Tim Starling: Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467
[23:58:56] <icinga-wm>	 RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook