[00:16:34] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10Jclark-ctr) updated netbox with correct ports [00:16:44] 10SRE, 10ops-eqiad: msw-c7-eqiad down - https://phabricator.wikimedia.org/T287180 (10Jclark-ctr) 05Open→03Resolved [00:17:37] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel dbstore1004 to db1183 - https://phabricator.wikimedia.org/T286468 (10Jclark-ctr) 05Open→03Resolved relabeled host [00:23:06] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) will follow back up Monday to double check ports. [00:43:59] (03PS1) 10Legoktm: Increase lilypond version cache TTL to 1 hour [extensions/Score] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/707430 [01:41:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:44:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 67 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:47:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 45 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:52:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:56:13] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 51 probes of 623 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:58:27] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 629 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:00:41] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:09] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Beeswaxcandle) >>! In T257066#7233536, @Legoktm wrote: > OK, we're now running lilypond 2.22.0 which should... [05:39:18] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) According to https://www.mediawiki.org/w/index.php?title=Topic:Wbuk08w1anifjyak#flow-post-wbunuhciop2u1yl8 and https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_po... [06:39:11] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#7233536, @Legoktm wrote: > https://test.wikipedia.org/wiki/Score/plwikisource/3 isn't... [09:07:59] (03PS1) 10Jelto: move gitlab rails exporter to port 8083 [puppet] - 10https://gerrit.wikimedia.org/r/707859 (https://phabricator.wikimedia.org/T275170) [09:16:48] (03PS1) 10Jelto: prometheus::ops add job to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) [09:18:19] (03CR) 10Jelto: add gitlab2001 to host_vars and variables (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/707350 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [09:29:12] (03CR) 10Jelto: [C: 04-1] "Everything except rails exporter works and is reachable on gitlab1001 (fix for rails exporter in review https://gerrit.wikimedia.org/r/c/o" [puppet] - 10https://gerrit.wikimedia.org/r/707860 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:35:33] (03CR) 10Jelto: [C: 04-1] "This change conflicts with https://gerrit.wikimedia.org/r/c/operations/puppet/+/707252 (also using mw1439-mw1442)" [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [09:57:59] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 704 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:01:51] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:35] afaics 503s from shellbox [10:04:45] yep seems so also from https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=parsoid&var-origin_instance=All&var-destination=All [10:06:27] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw. [10:06:28] ubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:06:45] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2005.codfw. [10:06:45] ubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:08:25] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:08:43] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:26:05] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2001.codfw.wmnet are marked down but po [10:26:05] ps://wikitech.wikimedia.org/wiki/PyBal [10:26:57] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 783 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:29:39] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2010.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2014.codfw. [10:29:39] ubernetes2017.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:29:59] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:31:35] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:41] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:35:59] PROBLEM - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:36:32] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:37:16] are the parsoid fatals shellbox related? or something else? [10:37:21] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw. [10:37:22] ubernetes2015.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:37:39] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw. [10:37:39] ubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:38:29] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:41:41] RECOVERY - LVS shellbox codfw port 4008/tcp - Shellbox- shellbox.svc.codfw.wmnet IPv4 on shellbox.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 358 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:42:21] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 308 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:43:12] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:44:19] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 10 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:49:01] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2007.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikime [10:49:02] wiki/PyBal [10:50:57] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:51:17] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:04:15] !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/Translate/scripts/moveTranslatablePage.php --wiki=commonswiki --reason='OTRS -> VRTS renaming process; see [[Phab:T280392]] and [[Phab:T280397]]' --move-subpages 'Commons:OTRS' 'Commons:Volunteer Response Team' 'Martin Urbanec' # T287321 [11:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:25] T287321: Move translatable page OTRS on Commons - https://phabricator.wikimedia.org/T287321 [11:04:25] T280392: Migrate Wikimedia away from OTRS software and branding - https://phabricator.wikimedia.org/T280392 [11:04:25] T280397: Replace OTRS text on Commons - https://phabricator.wikimedia.org/T280397 [13:46:52] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:21:52] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [14:41:48] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:46:48] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [15:31:19] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2791 MB (3% inode=84%): /tmp 2791 MB (3% inode=84%): /var/tmp 2791 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [22:57:27] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state