[00:37:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:02] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:16:50] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:41:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [02:46:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [03:06:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [03:11:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [03:26:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [03:31:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [03:59:37] 10SRE, 10Diff-blog, 10Technical blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Krinkle) I've added a row for this attribute to the table at . In doing so, I checked the current sta... [07:37:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:42:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:52:22] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 386354776 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:52:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [11:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [11:33:38] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:35:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:00:51] 10SRE, 10Beta-Cluster-Infrastructure: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd - https://phabricator.wikimedia.org/T241481 (10Majavah) 05Open→03Declined This machine will be decommissioned shortly (T283013) [12:03:12] 10Puppet, 10Beta-Cluster-Infrastructure: MIssing hiera settings for deployment-parsoid11.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T259533 (10Majavah) 05Open→03Declined This VM was removed at some point in the past. [12:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [12:37:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [13:19:38] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [13:40:43] (03PS1) 10Majavah: beta: remove deployment-deploy02 [puppet] - 10https://gerrit.wikimedia.org/r/700426 (https://phabricator.wikimedia.org/T278689) [13:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [13:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [13:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [13:55:53] (03PS4) 10Majavah: Add grafana-cloud.{wm.o,d.wmnet} to replace labs [dns] - 10https://gerrit.wikimedia.org/r/684099 [13:56:32] (03PS3) 10Majavah: Add grafana-cloud.w.o as alias of grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/684100 [13:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [14:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [14:20:18] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:21] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393 (10Majavah) One more issue: given cloud vps does not have per-role hiera keys, we need to rely on instance prefix based hiera (which is different for text an... [15:49:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:23] (03CR) 10Paladox: "I didn't realise it was done in these modules either." [puppet] - 10https://gerrit.wikimedia.org/r/700331 (owner: 10Paladox) [15:57:48] (03PS1) 10Urbanecm: ptwikinews: Remove NS ID 102,103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700428 (https://phabricator.wikimedia.org/T285163) [16:00:24] (03CR) 10Urbanecm: [C: 04-1] "otherwise lgtm" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) (owner: 10Zabe) [16:01:45] (03CR) 10RhinosF1: [C: 03+1] ptwikinews: Remove NS ID 102,103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700428 (https://phabricator.wikimedia.org/T285163) (owner: 10Urbanecm) [16:08:59] (03PS1) 10Urbanecm: Change vi.wikisource logo to the same logo being used at en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700430 (https://phabricator.wikimedia.org/T284612) [16:15:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:42:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:57:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:59:03] (03PS3) 10Zabe: Rename Portal and Portal talk namespaces on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) [17:00:00] (03CR) 10Zabe: Rename Portal and Portal talk namespaces on viwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) (owner: 10Zabe) [17:02:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [17:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [17:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [17:20:02] (03CR) 10Urbanecm: [C: 03+1] "LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700065 (https://phabricator.wikimedia.org/T284868) (owner: 10Zabe) [17:28:10] (03CR) 10Urbanecm: [C: 03+1] beta: remove deployment-deploy02 [puppet] - 10https://gerrit.wikimedia.org/r/700426 (https://phabricator.wikimedia.org/T278689) (owner: 10Majavah) [17:52:26] PROBLEM - SSH on mw1303.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:18:25] (03PS1) 10Zabe: Add 'managechangetags' to the 'abusefilter' group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700433 (https://phabricator.wikimedia.org/T285167) [18:36:04] 10SRE, 10Analytics-Radar, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843 (10Naijafile011) Some mad developers are doing this, I think they back-off already because when I clicked on the link, it shows the website has been suspend... [18:53:02] RECOVERY - SSH on mw1303.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:59:15] (03PS2) 10Zabe: Add 'managechangetags' to the 'abusefilter' group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700433 (https://phabricator.wikimedia.org/T285167) [19:00:07] (03PS3) 10Zabe: Add 'managechangetags' to the 'abusefilter' group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700433 (https://phabricator.wikimedia.org/T285167) [19:24:25] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700433 (https://phabricator.wikimedia.org/T285167) (owner: 10Zabe) [19:25:48] (03CR) 10Urbanecm: [C: 03+1] "code looks good, unsure whether the range is correct. Adding WMCS people to review this from their perspective." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700160 (owner: 10Majavah) [19:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [19:42:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [19:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [19:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [19:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [20:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [20:37:10] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:37:54] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:37:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [22:42:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [22:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:01:12] processor is jinxed :P [23:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [23:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org