[00:09:54] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [00:11:34] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [00:11:48] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [00:13:30] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:39:16] !log doing a rolling restart of zotero in codfw to hopefully fix DNS ENOTFOUND issues [00:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:01] didn't work [00:47:07] filed https://phabricator.wikimedia.org/T286360 for the zotero/citoid DNS issues [00:47:21] !log zotero rolling restart didn't help, filed T286360 for DNS issues [00:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:28] T286360: zotero failing with getaddrinfo ENOTFOUND en.wikipedia.org and url-downloader.codfw.wikimedia.org, causing Citoid errors - https://phabricator.wikimedia.org/T286360 [01:43:18] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:47:12] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:54:50] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:56:46] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:36:32] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [03:38:26] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [04:01:08] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [04:03:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [04:48:51] (03PS1) 10Legoktm: planet: Add Damian's blog to en [puppet] - 10https://gerrit.wikimedia.org/r/703795 [04:55:44] (03CR) 10Legoktm: "@Damian: once this is merged, future posts from you will show up on https://en.planet.wikimedia.org/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/703795 (owner: 10Legoktm) [05:09:38] (03CR) 10Damian: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/703795 (owner: 10Legoktm) [06:24:50] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [06:26:44] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:42:46] (03PS1) 10Marostegui: db1180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703827 [06:50:04] (03CR) 10Marostegui: [C: 03+2] db1180: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703827 (owner: 10Marostegui) [07:00:05] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210709T0700) [07:31:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [07:33:54] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:47:16] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [07:48:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:49:14] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:52:00] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:15:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10Marostegui) @cmooney I haven't been able to get ahold of you this week, so leaving the comment I left on IRC here: My preferred order for the sw... [08:25:42] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [08:29:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:03:32] <_joe_> I think zotero in codfw might need restarting [09:38:46] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [09:40:38] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:50:38] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [09:52:32] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [09:53:08] PROBLEM - SSH on mw1269.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [09:56:00] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:02:36] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:02] <_joe_> uhhh can someone take a look at cr2-esams? I'm looking at citoid [10:07:52] <_joe_> Is anyone else around apart from me? jbond maybe? [10:07:54] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [10:09:22] _joe_: looking [10:09:58] <_joe_> jbond: thanks [10:10:38] (03CR) 10WMDE-Fisch: [C: 03+1] "Works like a charm and does what we want." [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [10:11:45] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [10:12:38] (03CR) 10WMDE-Fisch: [C: 03+1] "FYI: I also verified, that the 1st result added here would also be the 1st result when completely disabling the search improvements and re" [extensions/VisualEditor] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703649 (owner: 10WMDE-Fisch) [10:13:57] <_joe_> !log recreated all pods for zotero in codfw [10:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:32] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [10:21:22] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [10:27:57] <_joe_> I'm not sure I've resolved the issues on citoid, because I see some changes in behaviour a few days ago that hint at it needing more CPU [10:28:05] <_joe_> or maybe more memory, still unsure [10:28:15] <_joe_> but let's see if it alerts again [10:29:02] ack, the HE thing issu is not much to worry about as its just an IX peer. looks like they took the interface down so probably dowing maintance. just getting some info and will send them an email [10:29:18] <_joe_> jbond: ack thanks [10:29:29] <_joe_> I'll look at what proton is doing [10:29:36] ack [10:34:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) @Marostegui Ok thanks for the comments. I've not been feeling so good so hadn't been online. Will review Monday against feedback from... [10:40:03] (03PS1) 10Giuseppe Lavagetto: mwdebug: tune up a bit the codfw deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/703834 (https://phabricator.wikimedia.org/T280497) [10:40:05] (03PS1) 10Giuseppe Lavagetto: mwdebug: add servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/703835 (https://phabricator.wikimedia.org/T284418) [10:43:30] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:46:27] (03PS1) 10Giuseppe Lavagetto: Add configuration for running on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703836 (https://phabricator.wikimedia.org/T284418) [10:53:55] RECOVERY - SSH on mw1269.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:56:32] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 423, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:04] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [11:10:02] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:31:37] _joe_: just making sure you saw T286360 already [11:31:38] T286360: zotero failing with getaddrinfo ENOTFOUND en.wikipedia.org and url-downloader.codfw.wikimedia.org, causing Citoid errors - https://phabricator.wikimedia.org/T286360 [11:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [11:32:54] <_joe_> majavah: saw it after I restarted the zotero pods [11:37:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10Marostegui) I hope you get better soon. I am off next week but someone from the team will contact you next week. From my point of view, I think... [11:38:03] <_joe_> and I have one suspicion around that [11:39:14] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/703839 [11:40:00] <_joe_> !log deleting coredns pod in codfw, potentially causing T286360 [11:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:08] T286360: zotero failing with getaddrinfo ENOTFOUND en.wikipedia.org and url-downloader.codfw.wikimedia.org, causing Citoid errors - https://phabricator.wikimedia.org/T286360 [11:44:16] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:49:20] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/703839 (owner: 10Marostegui) [11:52:33] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/703658 [11:52:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [11:52:52] RECOVERY - MariaDB memory on clouddb1019 is OK: OK Memory 1% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:54:21] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/703658 (owner: 10Marostegui) [11:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1118', diff saved to https://phabricator.wikimedia.org/P16809 and previous config saved to /var/cache/conftool/dbconfig/20210709-115609-marostegui.json [11:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:05] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 423, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:48:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: tune up a bit the codfw deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/703834 (https://phabricator.wikimedia.org/T280497) (owner: 10Giuseppe Lavagetto) [13:50:49] (03Merged) 10jenkins-bot: mwdebug: tune up a bit the codfw deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/703834 (https://phabricator.wikimedia.org/T280497) (owner: 10Giuseppe Lavagetto) [14:01:48] 10SRE, 10Wikimedia-Mailing-lists: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Ladsgroup) When did you get that email? I got it too so I looked up and saw that this mailing list doesn't have an owner, I added the blackhole mail address... [14:05:49] 10SRE, 10Wikimedia-Mailing-lists: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Ladsgroup) The underlying problem is much harder to fix, we didn't have standard concept of "archive/disabled" mailing list, mailman doesn't have such conce... [14:11:45] 10SRE, 10Wikimedia-Mailing-lists: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Ladsgroup) p:05Triage→03Medium [14:22:42] (03PS1) 10Giuseppe Lavagetto: mwdebug: Allow non-roots to perform a rolling restart [deployment-charts] - 10https://gerrit.wikimedia.org/r/703848 [14:37:44] (03PS2) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [14:41:24] (03PS3) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [14:41:51] (03CR) 10jerkins-bot: [V: 04-1] Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:42:22] (03PS4) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [14:42:42] (03CR) 10Joal: [C: 03+1] "It all makes sense ottomata (quick read)" [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:46:30] !log otto@deploy1002 Started deploy [analytics/refinery@cdb3fc5] (hadoop-test): Deploy for finalize event_default_test gobblin job in hadoop test - T271232 [14:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:38] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [14:47:03] (03PS5) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [14:47:30] (03CR) 10jerkins-bot: [V: 04-1] Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:48:31] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30158/console" [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:49:38] !log otto@deploy1002 Finished deploy [analytics/refinery@cdb3fc5] (hadoop-test): Deploy for finalize event_default_test gobblin job in hadoop test - T271232 (duration: 03m 08s) [14:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] (03PS6) 10Ottomata: Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) [14:54:00] (03CR) 10Ottomata: [C: 03+2] Gobblinize test refine_event and drop_event jobs [puppet] - 10https://gerrit.wikimedia.org/r/703786 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:00:46] (03PS1) 10Ottomata: test/data_purge - fix checksum on refinery-drop-raw-event [puppet] - 10https://gerrit.wikimedia.org/r/703854 (https://phabricator.wikimedia.org/T271232) [15:01:03] (03CR) 10jerkins-bot: [V: 04-1] test/data_purge - fix checksum on refinery-drop-raw-event [puppet] - 10https://gerrit.wikimedia.org/r/703854 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:01:08] (03PS2) 10Ottomata: test/data_purge - fix checksum on refinery-drop-raw-event [puppet] - 10https://gerrit.wikimedia.org/r/703854 (https://phabricator.wikimedia.org/T271232) [15:03:14] (03CR) 10Ottomata: [C: 03+2] test/data_purge - fix checksum on refinery-drop-raw-event [puppet] - 10https://gerrit.wikimedia.org/r/703854 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:16:01] (03PS1) 10Ottomata: Remove already absented camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/703857 (https://phabricator.wikimedia.org/T271232) [15:22:32] (03CR) 10Ottomata: [C: 03+2] Remove already absented camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/703857 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [15:35:20] 10SRE, 10Traffic: LetsEncrypt cert expiration warning for some ncredir names - https://phabricator.wikimedia.org/T286377 (10RLazarus) [15:35:31] 10SRE, 10Traffic: LetsEncrypt cert expiration warning for some ncredir names - https://phabricator.wikimedia.org/T286377 (10RLazarus) p:05Triage→03High [16:39:08] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:18] 10SRE, 10Wikimedia-Mailing-lists: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Aklapper) >>! In T286371#7202520, @Ladsgroup wrote: > When did you get that email? 07 Jul 2021 01:39:08 +0000 [17:36:25] (03PS1) 10Ottomata: Add gobblin job event_default [puppet] - 10https://gerrit.wikimedia.org/r/703867 (https://phabricator.wikimedia.org/T271232) [17:39:52] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:41:25] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Disable moderation mail notifications for messages sent to archived lists - https://phabricator.wikimedia.org/T286371 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Yup, that's the one I got too and I fixed it immediately. I'm sorry for inconvenience a... [17:49:40] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:01] (03PS1) 10Legoktm: admin: Update legoktm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/703868 [18:50:28] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:57:07] (03PS2) 10Legoktm: admin: Update legoktm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/703868 [18:58:04] (03PS3) 10Legoktm: admin: Update legoktm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/703868 [18:59:26] (03PS1) 10Ottomata: Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) [18:59:58] (03CR) 10jerkins-bot: [V: 04-1] Finalze gobblin event migration [puppet] - 10https://gerrit.wikimedia.org/r/703869 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [20:40:07] (03CR) 10Legoktm: [C: 03+2] admin: Update legoktm's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/703868 (owner: 10Legoktm) [22:36:31] !log running benchmarking scripts again shellbox [22:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:32] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - shellbox_4008: Servers kubernetes2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:09:28] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:10:02] oof [23:11:43] that was with concurrency 25 [23:12:53] (03PS1) 10Legoktm: shellbox: Bump to 8 replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/703878 [23:23:39] (03CR) 10Legoktm: [C: 03+2] shellbox: Bump to 8 replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/703878 (owner: 10Legoktm) [23:26:16] (03Merged) 10jenkins-bot: shellbox: Bump to 8 replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/703878 (owner: 10Legoktm) [23:27:24] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [23:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:38] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [23:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:30] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:49] (03PS1) 10RLazarus: dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) [23:52:02] (03CR) 10jerkins-bot: [V: 04-1] dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706) (owner: 10RLazarus) [23:59:57] (03PS2) 10RLazarus: dnsdisc: Improve "failed to check record" error message [software/spicerack] - 10https://gerrit.wikimedia.org/r/703879 (https://phabricator.wikimedia.org/T285706)