[00:16:34] <wikibugs>	 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10Jclark-ctr) updated netbox with correct ports
[00:16:44] <wikibugs>	 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10Jclark-ctr) 05Open→03Resolved
[00:17:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Relabel dbstore1004 to db1183 - https://phabricator.wikimedia.org/T286468 (10Jclark-ctr) 05Open→03Resolved relabeled host
[00:23:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) will follow back up Monday to double check ports.
[00:43:59] <wikibugs>	 (03PS1) 10Legoktm: Increase lilypond version cache TTL to 1 hour [extensions/Score] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707430
[01:41:59] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:44:21] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 67 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:47:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 45 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:52:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:56:13] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 51 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:58:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:00:41] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:09] <wikibugs>	 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Beeswaxcandle) >>! In T257066#7233536, @Legoktm wrote: > OK, we're now running lilypond 2.22.0 which should...
[05:39:18] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) According to https://www.mediawiki.org/w/index.php?title=Topic:Wbuk08w1anifjyak#flow-post-wbunuhciop2u1yl8 and https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_po...
[06:39:11] <wikibugs>	 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#7233536, @Legoktm wrote: > https://test.wikipedia.org/wiki/Score/plwikisource/3 isn't...
[09:07:59] <wikibugs>	 (03PS1) 10Jelto: move gitlab rails exporter to port 8083 [puppet] - 10https://gerrit.wikimedia.org/r/707859 (https://phabricator.wikimedia.org/T275170)
[09:16:48] <wikibugs>	 (03PS1) 10Jelto: prometheus::ops add job to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170)
[09:18:19] <wikibugs>	 (03CR) 10Jelto: add gitlab2001 to host_vars and variables (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto)
[09:29:12] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "Everything except rails exporter works and is reachable on gitlab1001 (fix for rails exporter in review https://gerrit.wikimedia.org/r/c/o" [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto)
[09:35:33] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "This change conflicts with https://gerrit.wikimedia.org/r/c/operations/puppet/+/707252 (also using mw1439-mw1442)" [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto)
[09:57:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 704 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:01:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:03:35] <elukey>	 afaics 503s from shellbox
[10:04:45] <elukey>	 yep seems so also from https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=parsoid&var-origin_instance=All&var-destination=All
[10:06:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.
[10:06:28] <icinga-wm>	 ubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:06:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw.
[10:06:45] <icinga-wm>	 ubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:08:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:08:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:26:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but po
[10:26:05] <icinga-wm>	 ps://wikitech.wikimedia.org/wiki/PyBal
[10:26:57] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 783 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:29:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2010.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2014.codfw.
[10:29:39] <icinga-wm>	 ubernetes2017.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:29:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:31:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:41] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:35:59] <icinga-wm>	 PROBLEM - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:36:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:37:16] <majavah>	 are the parsoid fatals shellbox related? or something else?
[10:37:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.
[10:37:22] <icinga-wm>	 ubernetes2015.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:37:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.
[10:37:39] <icinga-wm>	 ubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:38:29] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:41:41] <icinga-wm>	 RECOVERY - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 358 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:42:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 308 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:43:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:44:19] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:49:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikime
[10:49:02] <icinga-wm>	 wiki/PyBal
[10:50:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:51:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:04:15] <urbanecm>	 !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/Translate/scripts/moveTranslatablePage.php  --wiki=commonswiki --reason='OTRS -> VRTS renaming process; see [[Phab:T280392]] and [[Phab:T280397]]' --move-subpages 'Commons:OTRS' 'Commons:Volunteer Response Team' 'Martin Urbanec' # T287321
[11:04:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:25] <stashbot>	 T287321: Move translatable page OTRS on Commons - https://phabricator.wikimedia.org/T287321
[11:04:25] <stashbot>	 T280392: Migrate Wikimedia away from OTRS software and branding - https://phabricator.wikimedia.org/T280392
[11:04:25] <stashbot>	 T280397: Replace OTRS text on Commons - https://phabricator.wikimedia.org/T280397
[13:46:52] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85%   - https://alerts.wikimedia.org
[14:21:52] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85%   - https://alerts.wikimedia.org
[14:41:48] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85%   - https://alerts.wikimedia.org
[14:46:48] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85%   - https://alerts.wikimedia.org
[15:31:19] <icinga-wm>	 PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2791 MB (3% inode=84%): /tmp 2791 MB (3% inode=84%): /var/tmp 2791 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops
[22:57:27] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state