[00:08:53] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:08] (03PS1) 10RLazarus: icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) [00:15:50] (03CR) 10jerkins-bot: [V: 04-1] icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) (owner: 10RLazarus) [00:16:42] demo of seprate event note field: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210812T2300 [00:17:54] not sure how the new calendar entries are generated, is there a master template? [00:23:19] tgr: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/release/+/master/make-deployment-calendar/deploymentcalendar/__init__.py combined with https://gerrit.wikimedia.org/g/mediawiki/tools/release/+/master/make-deployment-calendar/deployments-calendar.json is I think what you're looking for [00:23:58] but Tyler will actually know, I'm just a bystander [00:24:53] (03CR) 10Cwhite: [C: 03+1] "Looks like a good first iteration to me. Looking forward to trying it!" [puppet] - 10https://gerrit.wikimedia.org/r/711543 (owner: 10Filippo Giunchedi) [00:26:45] (03PS2) 10RLazarus: icinga: Use shlex to quote the command string for bash -c. [software/spicerack] - 10https://gerrit.wikimedia.org/r/712784 (https://phabricator.wikimedia.org/T288558) [00:40:44] (03CR) 10Cwhite: [C: 03+1] "I'm not sure what $el stands for, but they're all largely temporary variables. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [00:56:47] (03CR) 10A2093064: [C: 04-1] Add extendedconfirmed on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [01:02:10] !log running extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php for Growth wikis [01:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:29] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/712099 (owner: 10Filippo Giunchedi) [01:37:08] (03CR) 10Cwhite: [C: 03+1] pontoon: add lb module [puppet] - 10https://gerrit.wikimedia.org/r/712100 (owner: 10Filippo Giunchedi) [01:53:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:57:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:03] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:19:45] hi, the spambots are pretty phenomenal, are we able to purge the captchas or whatever at the moent? [03:19:55] moent [03:20:19] moment [05:07:06] (03PS1) 10Marostegui: mariadb: Move db1132 to m5. [puppet] - 10https://gerrit.wikimedia.org/r/712849 (https://phabricator.wikimedia.org/T288720) [05:08:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:22:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1132.eqiad.wmnet with reason: REIMAGE [05:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:26] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1132.eqiad.wmnet with reason: REIMAGE [05:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:00] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) 05Resolved→03Open One report is alerting with: > asw-c1-eqiad connected console ports attached to unracked device asw-c1-eqiad: consol... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210813T0700) [07:38:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1132 to m5. [puppet] - 10https://gerrit.wikimedia.org/r/712849 (https://phabricator.wikimedia.org/T288720) (owner: 10Marostegui) [07:39:38] (03CR) 10Filippo Giunchedi: pontoon: add service_names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712098 (owner: 10Filippo Giunchedi) [07:52:04] (03PS1) 10Filippo Giunchedi: pontoon: allow any IP range to access the frontend [puppet] - 10https://gerrit.wikimedia.org/r/712915 [07:52:33] (03CR) 10jerkins-bot: [V: 04-1] pontoon: allow any IP range to access the frontend [puppet] - 10https://gerrit.wikimedia.org/r/712915 (owner: 10Filippo Giunchedi) [07:53:32] (03PS2) 10Filippo Giunchedi: pontoon: allow any IP range to access the frontend [puppet] - 10https://gerrit.wikimedia.org/r/712915 [07:55:39] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: allow any IP range to access the frontend [puppet] - 10https://gerrit.wikimedia.org/r/712915 (owner: 10Filippo Giunchedi) [08:12:57] 10SRE, 10LDAP-Access-Requests: LDAP Access to nda user group for TAndic - https://phabricator.wikimedia.org/T288527 (10ema) p:05Triage→03Medium [08:18:43] o/ Events aren't flowing into logstash-beta.wmcloud.org again (?). Is there a particular set of files to tidy up to get them flowing again? [08:21:40] majavah: ^ [08:21:59] there was a few puppet errors recently on deployment-prep, if that helps [08:22:30] Puppet failure on deployment-logstash04.deployment-prep.eqiad1.wikimedia.cloud [08:22:40] Puppet failure on deployment-kafka-jumbo-3.deployment-prep.eqiad1.wikimedia.cloud [08:22:54] That very likely does have something to do with [08:22:56] phuedx: probably logstash's own logs filling the disk on deployment-logstash* [08:22:58] 10SRE, 10LDAP-Access-Requests: LDAP Access to nda user group for TAndic - https://phabricator.wikimedia.org/T288527 (10ema) [08:23:07] and Puppet failure on deployment-eventgate-3.deployment-prep.eqiad.wmflabs [08:23:10] so something in /var/log/logstash [08:23:27] I imagine the others are too full to send email, if only 04 is sending them [08:24:02] it doesn't say the exact error, but hopefully someone with access can check the logs [08:24:24] I'm not able to look at the moment [08:24:42] (03PS1) 10Filippo Giunchedi: facilities: remove test PDUs [puppet] - 10https://gerrit.wikimedia.org/r/712917 (https://phabricator.wikimedia.org/T287762) [08:25:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled: thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:25:23] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) Apologies for the delay in responding. We are taking this up with our ISP as we do not see any traffic coming back from them when capturing... [08:25:28] (nor am I really interested to, given the total lack of attention to deployment-prep maintenance by the foundation) [08:26:59] godog: I already sent https://gerrit.wikimedia.org/r/c/711717 [08:27:17] I saw it alert yesterday [08:27:36] RhinosF1: ah! thank you I missed it [08:27:42] Np [08:28:07] (03Abandoned) 10Filippo Giunchedi: facilities: remove test PDUs [puppet] - 10https://gerrit.wikimedia.org/r/712917 (https://phabricator.wikimedia.org/T287762) (owner: 10Filippo Giunchedi) [08:28:32] (03CR) 10Filippo Giunchedi: [C: 03+2] remove test PDUs that are being shipped back [puppet] - 10https://gerrit.wikimedia.org/r/711717 (https://phabricator.wikimedia.org/T287762) (owner: 10RhinosF1) [08:36:35] 10SRE, 10LDAP-Access-Requests: LDAP Access to nda user group for TAndic - https://phabricator.wikimedia.org/T288527 (10ema) Hi @tandic! >>! In T288527#7273178, @TAndic wrote: > Should I file a new request through the task template linked or stick with this one? I've updated the task description to use the t... [08:37:26] PROBLEM - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.53 and port 443: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:37:42] uughhh sorry that's me [08:37:52] phew [08:37:54] I apologise [08:37:56] it looked *scary* [08:38:55] root partition is full on logstash0{5,6}.deployment-prep.eqiad1.wikimedia.cloud. /var/log/syslog and daemon.log are both large [08:39:01] scary as a network issue [08:39:09] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled: thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:39:38] phuedx: I think the best advice would be to archive them and see if saves space or whether they can wiped safely [08:39:51] (Safely get rid of big logs not needed) [08:40:19] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: REIMAGE [08:40:19] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet [08:40:23] No one actually owns deployment prep though so no best person to ask [08:40:24] RECOVERY - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 21 Jul 2025 03:04:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:49] my apologies again for the mispage :( [08:41:49] You made friday interesting. Let's just touch wood and hope that's it. [08:42:01] can i opt out of Interesting Friday? [08:42:04] ^ [08:42:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: REIMAGE [08:42:35] Where do I sign up for Boring Friday? [08:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:49] phuedx: never known that be possible [08:42:54] :D [08:43:14] kormat: you turn off your computer and phone and pretend that all communications have stopped [08:49:03] (03PS1) 10Ema: admin: Add tandic to ldap_only_users for 'wmf' group access [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) [08:49:38] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10ema) [08:53:59] (03CR) 10Ladsgroup: [C: 03+1] service: Enable paging for shellbox-constraints service [puppet] - 10https://gerrit.wikimedia.org/r/711737 (owner: 10Legoktm) [09:03:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10TAndic) >>Side question: https://www.mediawiki.org/wiki/Product_Analytics/Superset_Access <- should our team consider this page as out of date or just irrelevant to WMF s... [09:06:06] 10SRE, 10serviceops: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 (10jijiki) mcrouter in mwmaint2002 is on version 0.37, and I found this: [[ https://github.com/wikimedia/operations-debs-mcrouter/blob/upstream/ProxyDestination.cpp#L341 | ProxyDestination.cpp ]]. There is no poin... [09:10:25] (03PS1) 10Effie Mouzeli: hieradata: disable ssl listening on mcrouter except on mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/712920 (https://phabricator.wikimedia.org/T288787) [09:12:11] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10RhinosF1) I changed the link on that section. The rest of the requesting access doesn't seem off. [09:13:46] (03PS1) 10David Caro: wmcs.vps.puppet_alert: get the puppet files from config [puppet] - 10https://gerrit.wikimedia.org/r/712922 (https://phabricator.wikimedia.org/T288805) [09:15:43] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet [09:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:47] (03CR) 10Effie Mouzeli: "PCC is ok https://puppet-compiler.wmflabs.org/compiler1003/30560/" [puppet] - 10https://gerrit.wikimedia.org/r/712920 (https://phabricator.wikimedia.org/T288787) (owner: 10Effie Mouzeli) [09:17:12] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2001.codfw.wmnet [09:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfw.wmnet [09:17:24] (03PS2) 10Effie Mouzeli: hieradata: disable ssl listening on mcrouter except on mwmaint2002 [puppet] - 10https://gerrit.wikimedia.org/r/712920 (https://phabricator.wikimedia.org/T288787) [09:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10cmooney) The best we can probably do is share some screenshots from our Netflow collectors, showing outbound traffic to your range. Please note our... [09:27:32] 10SRE-swift-storage: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10fgiunchedi) [09:30:55] 10SRE, 10serviceops, 10Patch-For-Review: mcrouter crashing on mwmaint2002 - https://phabricator.wikimedia.org/T288787 (10Dzahn) upgrading mwmaint2002 will happen on or after September 13th, the day of the DC switch. [09:31:38] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10TAndic) Perfect, thank you! Would it be okay if I added a link directly to the Phabricator instructions (https://phabricator.wikimedia.org/project/profile/1564/) for mo... [09:35:42] !log mw1444 - signed puppet cert, initial run (after hardware fix) T279309 [09:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:50] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: REIMAGE [09:35:50] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [09:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:55] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10RhinosF1) It's a wiki, I don't see why not :) They're open for anyone to improve. [09:38:21] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on thanos-fe2003.codfw.wmnet with reason: REIMAGE [09:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:35] (03PS1) 10David Caro: wmcs.puppet_alert: allow disabling the puppet alerts [puppet] - 10https://gerrit.wikimedia.org/r/712923 [09:38:54] (03PS2) 10David Caro: wmcs.vps.puppet_alert: allow disabling the puppet alerts [puppet] - 10https://gerrit.wikimedia.org/r/712923 [09:42:23] !log mw1448, mw1449, mw1450 - powering on via mgmt - OS install, initial setup (T279309, T273915) [09:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:32] T273915: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 [09:42:32] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [09:49:44] (03PS1) 10Jcrespo: dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099 [puppet] - 10https://gerrit.wikimedia.org/r/712925 (https://phabricator.wikimedia.org/T287230) [09:52:24] PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:53:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1444.eqiad.wmnet with reason: new setup [09:53:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1444.eqiad.wmnet with reason: new setup [09:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:05] ACKNOWLEDGEMENT - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T273915 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:55:52] (03PS2) 10Jcrespo: dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099 [puppet] - 10https://gerrit.wikimedia.org/r/712925 (https://phabricator.wikimedia.org/T287230) [09:55:54] (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db2097, db2099 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/712926 (https://phabricator.wikimedia.org/T280979) [10:04:54] (03PS1) 10Dzahn: DHCP: add MAC addresses for mw1448 through mw1456 [puppet] - 10https://gerrit.wikimedia.org/r/712928 (https://phabricator.wikimedia.org/T273915) [10:07:45] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2003.codfw.wmnet [10:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:48] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC addresses for mw1448 through mw1456 [puppet] - 10https://gerrit.wikimedia.org/r/712928 (https://phabricator.wikimedia.org/T273915) (owner: 10Dzahn) [10:14:54] (03PS2) 10Dzahn: DHCP: add MAC addresses for mw1448 through mw1456 [puppet] - 10https://gerrit.wikimedia.org/r/712928 (https://phabricator.wikimedia.org/T273915) [10:18:43] (03CR) 10Dzahn: "please add an expiry_date and expiry_contact here in the code (" [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) (owner: 10Ema) [10:20:56] RECOVERY - PHP7 rendering on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.696 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:22:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1444.eqiad.wmnet with reason: new setup [10:22:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1444.eqiad.wmnet with reason: new setup [10:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] (03PS2) 10Ema: admin: Add tandic to ldap_only_users for 'wmf' group access [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) [10:37:45] (03PS1) 10Cathal Mooney: Add cloudsw2-d5-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/712930 (https://phabricator.wikimedia.org/T277340) [10:39:47] (03CR) 10Ayounsi: [C: 03+1] Add cloudsw2-d5-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/712930 (https://phabricator.wikimedia.org/T277340) (owner: 10Cathal Mooney) [10:41:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10ema) >>! In T288527#7273178, @TAndic wrote: > My contract will likely be extended next year, should I reapply for access next fiscal year or can we flag my account as ong... [10:43:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [10:45:27] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:46:02] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP Access to wmf user group for TAndic - https://phabricator.wikimedia.org/T288527 (10TAndic) Great! Thanks @ema and @Dzahn for the sleuthing & details and @RhinosF1 for the confidence push! [10:47:02] (03CR) 10Cathal Mooney: [C: 03+2] Add cloudsw2-d5-eqiad to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/712930 (https://phabricator.wikimedia.org/T277340) (owner: 10Cathal Mooney) [10:50:26] ^^ juniper alarm is me adding cloudsw2-d5-eqiad. Will clear in a short while once work complete. [10:53:40] (03PS1) 10Hnowlan: conftool: remove old maps hosts before decom [puppet] - 10https://gerrit.wikimedia.org/r/712932 (https://phabricator.wikimedia.org/T288810) [10:58:47] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [11:11:35] !log mw1455 - powering on via mgmt - OS install, initial setup (T279309, T273915) [11:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:45] T273915: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 [11:11:45] T279309: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 [11:13:09] (03PS2) 10Zabe: Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) [11:13:39] (03CR) 10Zabe: Add extendedconfirmed on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:15:41] (03CR) 10A2093064: [C: 03+1] Add extendedconfirmed on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712754 (https://phabricator.wikimedia.org/T287322) (owner: 10Zabe) [11:21:37] (03CR) 10Hnowlan: [C: 03+2] scap: make maps2009 the default maps canary [puppet] - 10https://gerrit.wikimedia.org/r/712420 (owner: 10Hnowlan) [11:31:05] (03CR) 10Hnowlan: [C: 03+2] profile::maps: remove old postgres init script [puppet] - 10https://gerrit.wikimedia.org/r/710933 (owner: 10Hnowlan) [11:36:10] !log cloudsw1-d5-eqiad - configuring new 2x40G trunk to cloudsw2-d5-eqiad with homer (T277340) [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:19] T277340: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 [11:46:09] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:11] (03CR) 10Marostegui: web_app: Created skeleton code for frontend, with new amendments to api_db and static files (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [11:58:47] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [11:58:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10cmooney) cloudsw2-d5-eqiad is now configured and ready for server connections. I believe this task can now be closed? [11:59:36] !log mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=mediawikiwiki --jobqueue # T288683 [11:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:48] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [12:11:38] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jelto on cumin1001.eqiad.wmnet for hosts: ` mw1455.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [12:13:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1450.eqiad.wmnet', 'mw1451.eqiad.wmnet', 'mw1452.eqiad.w... [12:19:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1448.eqiad.wmnet with reason: REIMAGE [12:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:11] !log dzahn@cumin1001 conftool action : set/weight=30; selector: name=mw1444.eqiad.wmnet [12:20:12] 10SRE-swift-storage, 10envoy, 10serviceops: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10fgiunchedi) [12:20:19] (03PS1) 10Dzahn: site: remove mw1444 from 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/712939 (https://phabricator.wikimedia.org/T279309) [12:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:32] !log mw1444 - scap pull, pooled as new API server for the first time [12:21:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1449.eqiad.wmnet with reason: REIMAGE [12:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:41] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1444.eqiad.wmnet [12:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1448.eqiad.wmnet with reason: REIMAGE [12:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:43] (03CR) 10H.krishna123: "Commented. We can merge it so that we can get tox sorted for CI." [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [12:23:25] (03CR) 10H.krishna123: web_app: Created skeleton code for frontend, with new amendments to api_db and static files (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [12:24:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1449.eqiad.wmnet with reason: REIMAGE [12:24:36] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1455.eqiad.wmnet with reason: REIMAGE [12:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] !log mwscript extensions/Translate/scripts/refresh-translatable-pages.php --wiki=commonswiki --jobqueue # T288683 [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:59] T288683: FuzzyBot overwriting fully translated pages with original text - https://phabricator.wikimedia.org/T288683 [12:26:48] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1455.eqiad.wmnet with reason: REIMAGE [12:26:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1450.eqiad.wmnet with reason: REIMAGE [12:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:10] (03CR) 10Ema: admin: Add tandic to ldap_only_users for 'wmf' group access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) (owner: 10Ema) [12:28:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1451.eqiad.wmnet with reason: REIMAGE [12:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:00] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1450.eqiad.wmnet with reason: REIMAGE [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:29] ACKNOWLEDGEMENT - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:29:29] ACKNOWLEDGEMENT - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:29:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw1450.eqiad.wmnet with reason: new setup [12:29:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw1450.eqiad.wmnet with reason: new setup [12:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1452.eqiad.wmnet with reason: REIMAGE [12:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:13] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1451.eqiad.wmnet with reason: REIMAGE [12:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:28] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1452.eqiad.wmnet with reason: REIMAGE [12:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:15] (03CR) 10Dzahn: [C: 03+1] "patch lgtm and matches ticket (google has this email address, not sure if Janstee needs to confirm it though)" [puppet] - 10https://gerrit.wikimedia.org/r/712919 (https://phabricator.wikimedia.org/T288527) (owner: 10Ema) [12:35:16] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1455.eqiad.wmnet'] ` and were **ALL** successful. [12:35:48] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] web_app: Created skeleton code for frontend, with new amendments to api_db and static files [software/bernard] - 10https://gerrit.wikimedia.org/r/703490 (https://phabricator.wikimedia.org/T285438) (owner: 10H.krishna123) [12:39:23] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 434 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:40:57] that's me ^ T288815 [12:40:58] T288815: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 [12:42:01] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 434 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:42:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1450.eqiad.wmnet', 'mw1451.eqiad.wmnet', 'mw1452.eqiad.wmnet'] ` and were **ALL** successful. [12:44:00] (03CR) 10Dzahn: [C: 03+2] site: remove mw1444 from 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/712939 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:44:06] (03PS2) 10Dzahn: site: remove mw1444 from 'insetup' role [puppet] - 10https://gerrit.wikimedia.org/r/712939 (https://phabricator.wikimedia.org/T279309) [12:47:35] PROBLEM - Check no envoy runtime configuration is left persistent on thanos-fe2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 434 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:52:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1454.eqiad.wmnet with reason: REIMAGE [12:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: new setup [12:53:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: new setup [12:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:54] !log set runtime envoy.reloadable_features.strict_1xx_and_204_response_headers=false on thanos-fe* - T288815 [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:01] T288815: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 [12:54:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1454.eqiad.wmnet with reason: REIMAGE [12:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:45] Daimona: ping ref T288774 [12:55:34] Pong [13:03:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) [13:06:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) [13:08:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) >>! In T273915#7276401, @Cmjohnson wrote: > @dzahn no worries, the on-stie work is done but needs firmware updates and the passwords reset. I'll have these... [13:09:40] (03PS4) 10Jelto: site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) [13:13:27] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [13:13:51] (03PS5) 10Jelto: site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) [13:14:27] (03CR) 10Dzahn: [C: 03+1] site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:21:37] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1447-1449].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [13:21:40] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1447-1449].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [13:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:54] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1450.eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [13:21:56] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1450.eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:43] !log mw1453 - manual powercycle after it never rebooted when the reimage cookbook tries to trigger one [13:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:43] (03CR) 10Jelto: [C: 03+2] site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/706485 (https://phabricator.wikimedia.org/T279309) (owner: 10Jelto) [13:25:14] (03PS1) 10Ssingh: site: switch doh4002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/712940 [13:26:01] (03CR) 10Dzahn: [C: 03+1] site: switch doh4002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/712940 (owner: 10Ssingh) [13:26:54] (03CR) 10Ssingh: [C: 03+2] site: switch doh4002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/712940 (owner: 10Ssingh) [13:27:37] (03PS2) 10Ssingh: site: switch doh4002 to O:wikidough [puppet] - 10https://gerrit.wikimedia.org/r/712940 [13:30:02] (03CR) 10Ssingh: [C: 03+2] Add doh4002 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/712400 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [13:30:10] 10SRE-swift-storage, 10envoy, 10serviceops: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10fgiunchedi) [13:30:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10dcaro) There's one alert firing with regards to that switch, is that expected? https://alerts.wikimedia.org/?q=instance%3Dcloudsw2-d5-eqiad.mgmt.eqiad.wmnet [13:30:47] (03Merged) 10jenkins-bot: Add doh4002 to BGP anycast in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/712400 (https://phabricator.wikimedia.org/T283503) (owner: 10Ssingh) [13:33:01] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:23] uhm [13:34:22] should be resolving soon [13:35:46] !log ran homer for Gerrit 712400: Set up BGP peering to doh4002 in ulsfo [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:51] All looks good on the router there sukhe :) [13:46:53] topranks: thanks for checking, and yes :) I should have run homer *before* I updated the Wikidough host! [13:48:12] ah it's no problem, impossible to do both at the exact same moment, one or the other doesn't matter! [13:51:23] ah? I got this error fwiw: https://puppetboard.wikimedia.org/report/doh4002.wikimedia.org/28e2b9bc6ac0581d9e0c5f44dc4727059e79b1ea [13:52:29] oh but it's possible that it carried over from the auditd failure, which is T287266 [13:52:29] T287266: Unexpected auditd service restart failure - https://phabricator.wikimedia.org/T287266 [14:03:30] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [14:04:02] 10SRE, 10Traffic: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) 05Open→03Resolved Marking this as resolved as we are now offering Wikidough from all our PoPs as an anycasted service. Thanks to @bblack, @ayounsi, and @cmooney for all their help! [14:18:53] (03PS1) 10Dzahn: site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) [14:19:38] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:20:41] (03PS2) 10Dzahn: site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) [14:21:35] (03CR) 10jerkins-bot: [V: 04-1] site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:21:57] (03PS3) 10Dzahn: site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) [14:24:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:24:59] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:25:33] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron +1 for the new task, opening one [14:26:06] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1451 through mw1456 as apppservers, A1, A8 [puppet] - 10https://gerrit.wikimedia.org/r/712970 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [14:29:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:30:04] 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) [14:31:08] PROBLEM - Host cloudsw1-c8-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:34] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:36] PROBLEM - Check systemd state on mw1384 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:56] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:32:08] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:32:20] PROBLEM - Host cloudsw1-d5-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:32:24] uhoh? [14:32:26] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:32:58] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01318 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:33:02] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:33:04] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:33:25] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:33:30] 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) Ran the following command to day on kafka-main2004: ` ./topicmappr rebuild --out-path /home/elukey/T225005/json --force-rebuild --zk-addr conf2004.cod... [14:33:40] it's all eqiad so far [14:33:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:33:48] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:33:52] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:33:57] looks like eqiad mgmt (?) network, cc topranks XioNoX [14:34:02] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:34:02] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:34:22] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:24] * topranks thanks godog... looking. [14:34:55] tgodog: that wouldn't explain the dbproxy* alerts? [14:34:55] sure np, yeah icinga is reporting a bunch of unreachable mgmt hosts [14:35:43] majavah: that's true, but they don't have to be related [14:36:12] Looks like mr1-eqiad is down, checking serial console now for sings of life. [14:36:14] I can confirm it's all mgmt hosts that are alerting. Was just wondering why I could not ssh to a mgmt host anymore [14:36:32] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:36:43] majavah: to my knowledge that would not explain the dbproxy alerts no [14:37:06] RECOVERY - Host asw2-a-eqiad is UP: PING WARNING - Packet loss = 80%, RTA = 1.17 ms [14:37:08] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [14:37:08] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [14:37:17] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:37:25] RECOVERY - Host cloudsw1-d5-eqiad.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [14:37:25] RECOVERY - Host cloudsw1-c8-eqiad.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [14:37:26] mmm the alerts like "asw2-b-eqiad is UP [14:37:29] are a bit confusing [14:37:39] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [14:37:42] it seems as if the entier virtual switch is down [14:37:43] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:38:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:38:09] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:38:13] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:38:23] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:38:23] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [14:38:59] over 900 "unreachable"/"unknown" alerts are all recovered again, heh [14:39:22] What?? [14:39:49] mr1-eqiad is looking ok (via console)... and pingable again from cr. [14:39:50] Which rack is that? [14:39:53] uptime shows 94 weeks. [14:39:57] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms [14:40:15] it was all 4 A,B,C and D [14:40:17] PROBLEM - Check systemd state on mw1419 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:25] Buf [14:40:26] mr1-eqiad is the gw for all the oob management in eqiad. [14:40:35] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:38] fixing that ferm service on mw1419 [14:40:56] topranks: qq - how do we have to read something like "PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL [14:41:04] Our haproxy on dbproxy1017 and dbproxy1021 show everything back up [14:41:04] !log mw1419 - started ferm [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:29] PROBLEM - Check systemd state on an-worker1095 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:41] elukey: In terms of how to interpret it? It means that Icinga cannot ping that device. [14:41:49] ferm starts fine now, it had failed because DNS was down for a moment [14:41:53] RECOVERY - Check systemd state on mw1419 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:54] So definitely db1117 lost connectivity [14:42:00] I can see that on the proxies' log [14:42:01] DNS query for 'prometheus1004.eqiad.wmnet' failed: query timed out < but OK now [14:42:09] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [14:42:14] topranks: exactly yes, in my mind I can't see how the mr1 being down could cause it [14:42:20] For physical servers, switches, routers Icinga monitors on their dedicated management ports (so mr1-eqiad problem may make sense). [14:42:27] ahhhh [14:42:40] topranks: But it wasn't only management no? [14:42:41] I think a bunch of the alerts were all due to DNS resolution [14:42:47] let me reschedule a couple [14:42:58] topranks: okok because for servers we have the ".mgmt" mentioned in the alert, that is clear [14:43:13] We had this on one of the haproxy hosts: Aug 13 14:30:04 dbproxy1017 haproxy[30391]: Backup Server mariadb/db1117:3325 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [14:43:18] Ok yes. It's not as clear for network devices. [14:43:32] But they are all managed "out of band", so the same as a server with ".mgmt" in the hostname. [14:43:51] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw144[7-9].eqiad.wmnet [14:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:21] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1450.eqiad.wmnet [14:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] I am still learning my way around - it does seem mr1 connects via asw2-a8-eqiad to get to CRs (and back to Icinga/other nodes) [14:44:34] !log an-worker1082 - started ferm (was failed due to DNS hickup) [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:05] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:23] !log an-worker1095 - started ferm, service failed [14:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:29] RECOVERY - Check systemd state on an-worker1095 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:51] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:56] !log jelto@cumin1001 conftool action : set/weight=25; selector: name=mw144[7-9].eqiad.wmnet [14:46:59] PROBLEM - Check systemd state on db1152 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:13] !log jelto@cumin1001 conftool action : set/weight=25; selector: name=mw1450.eqiad.wmnet [14:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:21] no more other ferm failures besides the ones logged above [14:47:26] but that was due to the outage [14:48:08] I'll reschedule the ssh check on mgmt [14:48:09] noticing many icinga checks with disabled notifications again (please don't, we always forget them) [14:48:23] still investigating... symptoms make an issue with mr1-eqiad, or path from CRs to it, likely, but not found anything concrete thus far. [14:49:05] PROBLEM - Check systemd state on an-worker1079 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:12] !log an-worker1079 - started failed ferm [14:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:23] yeah there were a bunch of puppet failures too, which I can't explain yet [14:52:14] godog: DNS lookup of the puppetmaster I bet [14:53:00] PROBLEM - Check systemd state on mw1452 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:23] these should be new nodes in theory --^ [14:53:28] they are [14:53:38] unrelated, just failed downtimes [14:53:43] mutante: yeah could be, I wonder if related to the mgmt issue [14:54:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: new setup [14:54:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: new setup [14:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:33] godog: yes, it happened at the same time [14:54:37] and went away [14:55:13] ack [14:55:34] I was also wondering about alert1001's port/connectivity but wouldn't explain only why mgmt [14:55:44] just like other checks like "is toolserver.org up" and whatnot, DNS [14:56:01] for just a moment [14:56:10] ACKNOWLEDGEMENT - memcached socket on mw1447 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory daniel_zahn new install https://wikitech.wikimedia.org/wiki/Memcached [14:56:10] ACKNOWLEDGEMENT - memcached socket on mw1448 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory daniel_zahn new install https://wikitech.wikimedia.org/wiki/Memcached [14:56:10] ACKNOWLEDGEMENT - memcached socket on mw1449 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory daniel_zahn new install https://wikitech.wikimedia.org/wiki/Memcached [14:56:10] ACKNOWLEDGEMENT - memcached socket on mw1450 is CRITICAL: connect to file socket /run/memcached/memcached.sock: No such file or directory daniel_zahn new install https://wikitech.wikimedia.org/wiki/Memcached [14:56:10] ACKNOWLEDGEMENT - Check systemd state on mw1451 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:10] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1451 is CRITICAL: Host mw1451 is not in mediawiki-installation dsh group daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:56:11] ACKNOWLEDGEMENT - Check systemd state on mw1452 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:11] ACKNOWLEDGEMENT - PHP7 rendering on mw1454 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 458 bytes in 0.001 second response time daniel_zahn new install https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:56:12] ACKNOWLEDGEMENT - Memcached on mw1455 is CRITICAL: connect to address 10.64.0.62 and port 11210: Connection refused daniel_zahn new install https://wikitech.wikimedia.org/wiki/Memcached [14:56:21] cleaning up what is unrelated [14:57:01] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw144[7-9].eqiad.wmnet [14:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:26] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw1450.eqiad.wmnet [14:57:30] RECOVERY - Check systemd state on an-worker1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:44] The link from asw2-a8-eqiad to mr1-ulsfo - over which all traffic from our normal "private" subnets to the management range in eqiad flows - did go down, so surely caused this. [14:57:54] The switch itself reports the flap: [14:58:05] https://www.irccloud.com/pastebin/om25MTnq/ [14:58:21] 14:36:58 was the timestamp (for recovery). [14:58:47] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [14:58:49] Oddly I can't find logs (they've turned over on the switch and either not present or my Kibana foo is failing me as I can't see them there.) [14:58:58] ack, thanks topranks [15:00:16] !log an-worker1117, an-worker1118 - started failed ferm (why are these slowly trickling in ) [15:00:21] topranks: yeah can't find asw2-a-eqiad logs either, there's a hole afaics https://logstash.wikimedia.org/goto/f2bfb9812e6168e41a637e4dbc4c9e45 [15:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:36] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:48] I can see the logs alright, I just don't see anything relating to this particular port. [15:01:19] for instance: [15:01:22] https://usercontent.irccloud-cdn.com/file/5vTDjRbh/image.png [15:01:43] (03PS1) 10Btullis: WIP: Begin work on the alluxio puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [15:02:02] marostegui: db1152 - ferm service is failed and need to be started [15:02:21] mutante: you do it or I do it? [15:02:33] The that connects mr1-eqiad is ge-8/0/10 btw. The one in those logs in the screenshot is unrelated (server in reboot cycle I'm guessing). [15:02:43] !log etherpad1002 - started failed ferm [15:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:06] marostegui: done now [15:03:16] RECOVERY - Check systemd state on db1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:16] mutante: thanks! [15:04:08] RECOVERY - Check systemd state on an-worker1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:19] godog: sry I misunderstood. there is indeed a gap. [15:06:30] which makes sense, as it will try to use the path that was broken to send them. [15:06:43] we were unlucky the local logs on the device rolled over during that window also :( [15:10:09] topranks: unfortunate indeed :| [15:10:46] I'm surprised though the switch sends logs over to mgmt ? [15:10:50] anyways not for now [15:11:28] it's usually best tbh, in that it is using a "separate" network to production traffic to send logs. So like that should be ok even if we have DDOS. [15:11:40] There may be an argument for more redundancy here though. [15:12:45] fair! [15:13:08] in unrelated news, there's a lvs backend alert on for thanos-fe2002, which AFAICS is up even according to pybal [15:13:23] to confirm this theory I"m going to restart pybal on lvs2010 (the standby) [15:14:40] !log restart pybal on lvs2010, to clear CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled [15:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:10] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001144 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:15:50] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:16:17] All I can say for certain is that this link flapped between 14:29 and 14:36 - https://netbox.wikimedia.org/dcim/cables/1784/ [15:16:26] siiiigh so a pybal restart "fixes" the alert [15:16:44] The devices both sides have "rolled over" their local logs so I've no insight as to why that happened. Nobody was working on site. [15:17:03] We also have no logs to remote syslog as the fault prevented them writing to remote log server :( [15:18:04] yeah not a whole lot to go on [15:18:43] ok lvs2009's turn [15:18:52] !log restart pybal on lvs2009, to clear CRITICAL - thanos-swift_443: Servers thanos-fe2002.codfw.wmnet are marked down but pooled [15:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:10] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:22:18] RECOVERY - Check systemd state on mw1452 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:31] Just remembered it's Friday 13th. [15:24:13] :) [15:25:18] lol yea, and the issues are so on time we could almost alert "attention, it's Friday around 5pm" [15:30:23] 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10elukey) All the files generated are like: ` elukey@kafka-main2001:~$ cat T225005/json/eqiad.mediawiki.page-move.json {"version":1,"partitions":[{"topic":"eqi... [15:30:29] !log mw1453 - racadm serveraction powercycle (down and was working until right before the switch issue) [15:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:06] topranks: 1453 worked..then shut down right before the switch issue.. and is still down, tried to start it [15:31:37] booting now [15:31:41] hmm ok... when you say "tried to start it", is the iDRAC on it still working? [15:31:54] it is working now, so I could powercycle [15:32:02] Hmm ok. [15:32:09] it wasnt working during that moment [15:32:25] actually it was first "why is this down" then "why is mgmt also down" and then "oooh.. the switch" [15:32:51] was installing like 4 at once, others did not have issues [15:33:41] mw1453 is connected to the same switch as the link that failed. [15:34:01] But - outside of a bug - I can't see how anything it would have sent could make the other link fail. [15:34:24] yea, it doesnt seem to make sense, yet that timing and same switch [15:35:06] I dont see it booting yet. maybe it died. will try the reimage cookbook again [15:37:48] ok cool. I agree the timing is very suspect alright, however unlikely it should be for one to affect the other. [15:39:40] !log mw1451, mw1452, mw1454 - rebooting after reimage, memcached needs one [15:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] yea, could be totally random and just noticing the symptoms [15:59:17] topranks: I feel like this server has issues, it doesn't get rebooted by the cookbook and manually rebooting also seems to fail, but oh well, we don't see another switch issue so.. I'll just leave it be for now [16:00:01] yeah from the switch logs it seems the network port was flapping for quite a time before - and no problems. So I don't think we need to be overly concerned. [16:01:09] ACK, *waves*, going afk now [16:04:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) mw1453 seems to be a special case. Unlike the other hosts it would not reboot when the cookbook tries to reboot it and manually restarting also seemed to be... [16:31:07] (03PS2) 10Bearloga: statistics::discovery: Stop metric calculation [puppet] - 10https://gerrit.wikimedia.org/r/712422 (https://phabricator.wikimedia.org/T227782) [16:39:11] 10SRE, 10Infrastructure-Foundations, 10netops: Link failure between mr1-eqiad and asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) Seems I was incorrect about how the switches manage logs. It keeps older log files similar to a typical unix system, gzips them up. Logs from... [16:42:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30561/console" [puppet] - 10https://gerrit.wikimedia.org/r/712422 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [16:42:47] (03CR) 10Elukey: [V: 03+1 C: 03+2] statistics::discovery: Stop metric calculation [puppet] - 10https://gerrit.wikimedia.org/r/712422 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [16:42:53] 10SRE, 10Infrastructure-Foundations, 10netops: Link failure between mr1-eqiad and asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) Logs from mr1-eqiad: ` Aug 13 14:28:53 mr1-eqiad rpd[1456]: RPD_OSPF_NBRUP: OSPF neighbor 208.80.154.204 (realm ospf-v2 ge-0/0/1.401 area 0.0.... [16:55:12] PROBLEM - Apache HTTP on mw1455 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 927 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:58:38] ^ 👀 [17:01:16] PROBLEM - mediawiki-installation DSH group on mw1452 is CRITICAL: Host mw1452 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:02:29] PROBLEM - Check systemd state on mw1455 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service,systemd-sysusers.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:48] PROBLEM - mediawiki-installation DSH group on mw1454 is CRITICAL: Host mw1454 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:04:40] PROBLEM - Check that envoy is running on mw1455 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [17:05:36] 10SRE, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10phuedx) [17:05:44] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [17:05:48] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [17:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:05] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [17:18:47] (Storage over 90%) resolved: Storage over 90% - https://alerts.wikimedia.org [17:19:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10cmooney) @dcaro thanks. It's nothing to worry about, the other one (cloudsw2-c8-eqiad) is showing the same. I'll touch base with @ayounsi next week and s... [17:28:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts: ` elastic1039.eqiad.wmnet `... [17:32:12] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [17:32:12] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309 [17:32:15] I just downtimed the last 6 new mediawiki appservers for the weekend as they are not fully finished and mutan.te maybe wants to do some more troubleshooting on mw1453 [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:50] win 30 [17:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org [18:01:55] all storage? [18:02:47] something cloud related it seems [18:04:40] Krinkle: cloudsw2 devices or something else? [18:04:48] indeed [18:05:56] ah, they're new devices still being set up and not used for anything important yet [18:10:08] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) Note: There is always a delay of 3 seconds before the 500 response is returned. [18:16:36] 10SRE, 10MW-on-K8s, 10serviceops: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Krinkle) [18:21:52] 10SRE, 10Services (watching), 10User-herron: Rebalance kafka partitions in main-{eqiad,codfw} clusters - https://phabricator.wikimedia.org/T288825 (10herron) Nice! Regarding upstream improvements, on a related note there will hopefully in the future be better control over partition movement within Kafka its... [18:27:13] 10SRE-swift-storage, 10envoy, 10serviceops: Envoy and swift HEAD with 204 response turns into 503 - https://phabricator.wikimedia.org/T288815 (10RLazarus) Summarizing the discussion from IRC: - "Permanent" is relative -- it looks like this only exists as a runtime option for temporary backward compatibility... [18:28:48] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) [18:29:34] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dancy) favicon.ico issue is an example of T288848 [18:34:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1039.eqiad.wmnet'] ` Of which those **FAILED**: ` ['elastic1039.eqiad... [18:43:23] !log reprepro: uploaded gdnsd-3.8.0-1~wmf1 to buster-wikimedia - T252132 [18:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:32] T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 [18:44:47] bblack: woho! [18:50:05] (03PS1) 10Andrew Bogott: Nova-fullstack: improve DNS resolution tests [puppet] - 10https://gerrit.wikimedia.org/r/712989 (https://phabricator.wikimedia.org/T288854) [18:52:37] sukhe: I'm waiting for early next week to deploy it on the actual DNS servers. The CI update is uploaded too and could theoretically land before that, which doesn't impact much other than your patch. [18:54:56] (03PS1) 10Jcrespo: mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) [18:55:02] 10SRE, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10phuedx) [19:10:17] bblack: yep, that's fine, let's wait till Monday. thanks! just saw the release notes [19:16:55] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10sbassett) [19:32:38] (03CR) 10RLazarus: [C: 03+2] icinga: Tweak --services API [puppet] - 10https://gerrit.wikimedia.org/r/710121 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [19:37:35] (03CR) 10Michael DiPietro: [C: 03+1] Nova-fullstack: improve DNS resolution tests [puppet] - 10https://gerrit.wikimedia.org/r/712989 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [19:58:05] (03PS4) 10Ladsgroup: microsites: Add Query Builder subpage to wdqs gui [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) [19:58:26] (03CR) 10Ladsgroup: "The security review is done, we can deploy this now \o/" [puppet] - 10https://gerrit.wikimedia.org/r/700317 (https://phabricator.wikimedia.org/T266703) (owner: 10Ladsgroup) [20:22:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1453.eqiad.wmnet ` The log can be found in `/var/log/w... [20:41:53] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:20] (03PS2) 10Andrew Bogott: Nova-fullstack: improve DNS resolution tests [puppet] - 10https://gerrit.wikimedia.org/r/712989 (https://phabricator.wikimedia.org/T288854) [20:49:22] (03PS1) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [20:50:44] (03PS3) 10Andrew Bogott: Nova-fullstack: improve DNS resolution tests [puppet] - 10https://gerrit.wikimedia.org/r/712989 (https://phabricator.wikimedia.org/T288854) [20:50:46] (03PS2) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [20:52:04] (03CR) 10Andrew Bogott: [C: 03+2] Nova-fullstack: improve DNS resolution tests [puppet] - 10https://gerrit.wikimedia.org/r/712989 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [20:52:50] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [20:53:35] (03CR) 10Krinkle: [C: 03+1] sre.switchdc.mediawiki: Run the warmup cache script at least 6 times [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) (owner: 10Legoktm) [20:54:27] (03PS1) 10Andrew Bogott: nova_fullstack_test.py: run through Black [puppet] - 10https://gerrit.wikimedia.org/r/713007 [20:59:40] (03PS3) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [21:00:53] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test.py: run through Black [puppet] - 10https://gerrit.wikimedia.org/r/713007 (owner: 10Andrew Bogott) [21:02:04] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [21:04:26] (03PS1) 10Andrew Bogott: nova-fullstack: strip ip strings before comparing. [puppet] - 10https://gerrit.wikimedia.org/r/713009 (https://phabricator.wikimedia.org/T288854) [21:05:43] (03PS2) 10Andrew Bogott: nova-fullstack: strip ip strings before comparing. [puppet] - 10https://gerrit.wikimedia.org/r/713009 (https://phabricator.wikimedia.org/T288854) [21:07:49] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: strip ip strings before comparing. [puppet] - 10https://gerrit.wikimedia.org/r/713009 (https://phabricator.wikimedia.org/T288854) (owner: 10Andrew Bogott) [21:08:38] I think the mailman bounce handler crashed, I'll take a look when I'm back at my computer next [21:14:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1453.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1453.eqiad.wmnet'] ` [21:33:21] (03CR) 10Bstorm: [C: 03+1] wmsc.puppet_alert: force utf-8 encoding when opening files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/711106 (https://phabricator.wikimedia.org/T288508) (owner: 10David Caro) [21:36:52] (03CR) 10Bstorm: [C: 03+1] wmcs.vps.puppet_alert: allow disabling the puppet alerts [puppet] - 10https://gerrit.wikimedia.org/r/712923 (owner: 10David Caro) [21:36:57] (03PS1) 10Jdlrobson: Enable page previews on German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) [21:42:59] (03PS4) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [21:45:07] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [21:47:44] (03PS5) 10Andrew Bogott: nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 [21:49:32] (03CR) 10jerkins-bot: [V: 04-1] nova-fullstack: send logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/713006 (owner: 10Andrew Bogott) [21:50:32] (03CR) 10Bstorm: [C: 03+1] wmcs.vps.puppet_alert: get the puppet files from config [puppet] - 10https://gerrit.wikimedia.org/r/712922 (https://phabricator.wikimedia.org/T288805) (owner: 10David Caro) [21:52:45] (Storage over 90%) firing: Storage over 90% - https://alerts.wikimedia.org