[00:01:35] (03CR) 10Dzahn: "this part is different from toolhub (where you used the metafo resources) because this one is actually active-active, afaict" [dns] - 10https://gerrit.wikimedia.org/r/693968 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:08:02] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 2192511 MB (27% inode=76%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [00:57:24] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:04:15] (03CR) 10Juan90264: Enable ArticlePlaceholder for kswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735627 (https://phabricator.wikimedia.org/T294632) (owner: 10Juan90264) [01:06:44] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10Legoktm) a:03Legoktm [01:31:28] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:37:32] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.75 ms [04:06:42] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:02] PROBLEM - snapshot of s2 in codfw on alert1001 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2021-10-27 04:29:59 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:33:04] RECOVERY - snapshot of s2 in codfw on alert1001 is OK: Last snapshot for s2 at codfw (db2101.codfw.wmnet:3312) taken on 2021-10-30 03:49:32 (871 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:08:40] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:29:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [06:34:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [08:32:28] (03CR) 10Ideophagous: "fix applied locally, issue with "git review" command" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [08:35:06] 10SRE-swift-storage, 10User-Inductiveload: Unable to upload to Commons: uploadstash-file-not-found: Key "187kyl5ozj74.xtav8j.51508.djvu" not found in stash - https://phabricator.wikimedia.org/T278104 (10AlexisJazz) Copy of my comment on T292954#7469361: Using bigChunkedUpload.js I uploaded a 921MB video: ` 0... [09:05:50] (03PS1) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [09:05:52] (03PS1) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [09:06:48] (03PS2) 10Ideophagous: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735712 [09:07:18] (03PS2) 10Ideophagous: reapplied changes to arywiki ns after hard reset, Bug:T291737 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 [09:09:24] (03CR) 10Ideophagous: "see: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/735712" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731290 (owner: 10Ideophagous) [09:11:48] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:12:48] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:31:38] (03PS1) 10Urbanecm: Add edit-legal to editprotected grant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735716 [11:32:00] (03PS2) 10Urbanecm: Add edit-legal to editprotected grant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735716 [11:32:28] (03Abandoned) 10Urbanecm: [DNM] Run logo manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731853 (owner: 10Urbanecm) [13:16:02] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:29:03] !log Start server-side upload for 1 video file (T291418) [13:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:11] T291418: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T291418 [14:48:29] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10AntiCompositeNumber) I've seen this alert pop up a few times in the last few days... [16:19:06] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:50:28] 10SRE-swift-storage, 10User-Inductiveload: Unable to upload to Commons: uploadstash-file-not-found: Key "187kyl5ozj74.xtav8j.51508.djvu" not found in stash - https://phabricator.wikimedia.org/T278104 (10Koavf) [17:29:45] AntiComposite: that ticket is just for codfw [17:30:05] So 2xxx [17:30:35] Spookreeeno, https://phabricator.wikimedia.org/T283582#7449164 [17:32:59] AntiComposite: oh ye, it's on the list but as timed out [18:55:18] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:56:02] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:57:16] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 32, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:57:52] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:08:54] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:25:16] !log restarting blazegraph on wdqs1007 (jvm stuck) [19:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:26] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 5.452e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:44] dcausse: worth acking or action needed? ^ [19:38:56] ^ expected wdqs1007 has been stuck for almost 17hours [19:39:03] it's catching up [19:39:26] the alert will resolve itself soon [19:40:58] Ah ok [20:09:50] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:18] RECOVERY - WDQS high update lag on wdqs1007 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.031e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:54:54] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [20:55:33] (03CR) 10Urbanecm: "thanks for the +1, but I'd appreciate a merge, as I can't do that myself :). Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/734565 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [21:57:25] (03PS1) 10Zabe: maintain-views.yaml: remove dropped ep_* tables [puppet] - 10https://gerrit.wikimedia.org/r/735723 (https://phabricator.wikimedia.org/T216481)