[02:14:48] (03CR) 10Gergő Tisza: "Would it be simpler to add the dblist reference to wikipedias.yaml? I don't think we have any non-Growth Wikipedias at this point." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720074 (https://phabricator.wikimedia.org/T290582) (owner: 10Urbanecm) [02:18:44] (03CR) 10Gergő Tisza: "Commit subject must be followed by an empty line." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [02:19:12] (03PS2) 10Gergő Tisza: Stream config changes for android_daily_stats schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [02:19:53] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable AddLink for next round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723517 (https://phabricator.wikimedia.org/T290011) (owner: 10Kosta Harlan) [03:00:36] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1940 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:32] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:05:44] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) p:05Triage→03Medium [03:07:02] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) p:05Triage→03Medium [03:08:50] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) @Marostegui I will be back on site tomorrow. if you are available, I can ping you while onsite. Thank you. [03:51:25] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing List for the Wikimedians of United Arab Emirates User Group - https://phabricator.wikimedia.org/T291769 (10Ladsgroup) a:03Ladsgroup The link to the UG: https://meta.wikimedia.org/wiki/Wikimedians_of_United_Arab_Emirates_User_Group [03:53:09] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing List for the Wikimedians of United Arab Emirates User Group - https://phabricator.wikimedia.org/T291769 (10Ladsgroup) Do you want the mailing list to be public or private? [04:13:30] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:15:36] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:30:00] 10SRE, 10ops-eqiad, 10Platform Engineering, 10serviceops: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Marostegui) p:05Triage→03Medium [04:30:20] 10SRE, 10ops-eqiad, 10Analytics: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10Marostegui) p:05Triage→03Medium [04:43:26] (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/723807 (https://phabricator.wikimedia.org/T290327) [04:44:16] (03CR) 10Marostegui: [C: 03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/723807 (https://phabricator.wikimedia.org/T290327) (owner: 10Marostegui) [04:52:10] 10SRE, 10DBA, 10Traffic, 10User-Ladsgroup, 10Wikimedia-Incident: 2021-09-04 enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Marostegui) 05Open→03Resolved The private task T290394 has been resolved after the patch to mitigate this was pushed. Closing this task too. [04:56:01] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing List for the Wikimedians of United Arab Emirates User Group - https://phabricator.wikimedia.org/T291769 (10Vikoula5) Private ( for all who subscribed ) [04:58:28] (03PS1) 10Marostegui: maintain-views.yaml: Remove afl_filter from the view [puppet] - 10https://gerrit.wikimedia.org/r/723808 (https://phabricator.wikimedia.org/T291806) [04:59:41] (03CR) 10Marostegui: "Brooke I believe this is the only thing needed puppet-wise, and then re-create the views. Please feel free to merge and recreate the views" [puppet] - 10https://gerrit.wikimedia.org/r/723808 (https://phabricator.wikimedia.org/T291806) (owner: 10Marostegui) [05:18:55] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing List for the Wikimedians of United Arab Emirates User Group - https://phabricator.wikimedia.org/T291769 (10Ladsgroup) 05Open→03Resolved Done. Create an account connected to these email addresses and you have access to the mailing list: https://l... [05:28:10] !log Remove flaggedimages from s2 T290340 [05:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:17] T290340: Drop the flaggedimages table from Wikimedia production - https://phabricator.wikimedia.org/T290340 [05:30:48] (03PS8) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [05:37:44] 10SRE: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10E.botha) Im on mobile with no root privileges so cant use a code at all. how can I get jy public SSH --~~~ [05:39:27] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Joe) [05:56:30] !log Drop labswiki from m5 T167973 [05:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:37] T167973: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 [06:07:17] !log upgrade php7.2 in eqiad - T291052 [06:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:23] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [06:13:18] !log rolling restart php-fpm in eqiad - T291052 [06:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:24] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [06:15:10] <_joe_> effie: I don't think you need it if you installed the upgrade [06:15:24] <_joe_> it should automatically restart php-fpm upon package install [06:15:55] <_joe_> Status: Up | Server Admin Log: https://w.wiki/6ah | This channel is logged at https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/ | SRE Clinic Duty: _joe_ [06:16:06] <_joe_> err damn [06:16:13] <_joe_> better :P [06:18:45] sigh right, I wasnt thinking [06:19:00] (03CR) 10Nikerabbit: [C: 03+1] Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [06:19:11] thanks [06:37:02] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-09-27 03:16:14 (1135 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:51:57] 10SRE, 10serviceops, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10jijiki) 05Open→03Resolved a:03jijiki [06:54:12] (03PS1) 10Marostegui: production-m5.sql: Remove labswiki grants [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) [06:57:50] (03CR) 10Marostegui: "I have cleaned up the grants and doing a re-check on all m5 hosts, I have found this:" [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:07:30] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Use --force in shutdown [software/spicerack] - 10https://gerrit.wikimedia.org/r/723417 (owner: 10Muehlenhoff) [07:07:52] !log Remove flaggedimages from s3 T290340 [07:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:59] T290340: Drop the flaggedimages table from Wikimedia production - https://phabricator.wikimedia.org/T290340 [07:18:13] !log swift eqiad-prod: add weight to ms-be10[64-67] - T290546 [07:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:19] T290546: Put ms-be10[64-67] in service - https://phabricator.wikimedia.org/T290546 [07:22:00] (03CR) 10Muehlenhoff: [C: 03+2] profile::mail::mx: Remove OS checks [puppet] - 10https://gerrit.wikimedia.org/r/723487 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [07:35:27] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Prefer mx2001 over mx1001 for internal smarthosts" [puppet] - 10https://gerrit.wikimedia.org/r/723434 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [07:47:12] (03PS1) 10Muehlenhoff: Remove Hiera entries for screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/723988 (https://phabricator.wikimedia.org/T288028) [07:49:38] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] README.md: mention requirement for rsync [software/swift-ring] - 10https://gerrit.wikimedia.org/r/722927 (owner: 10MVernon) [07:52:51] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: set search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [07:57:32] (03CR) 10Jcrespo: [C: 03+1] "The patch as is was correct- I checked if further puppet changes should be needed, and while I saw some issues with the way grants are han" [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:58:09] (03CR) 10Jcrespo: [C: 03+1] "We removed the backups before, right? I cannot remember." [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:58:26] (03CR) 10Marostegui: production-m5.sql: Remove labswiki grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:58:28] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Remove labswiki grants [puppet] - 10https://gerrit.wikimedia.org/r/723986 (https://phabricator.wikimedia.org/T167973) (owner: 10Marostegui) [07:59:34] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1940 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:01:38] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:07:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:12] 10SRE, 10serviceops, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10Joe) Just as a side note, the number of timeouts on parsoid went from ~ 5k/day before the change to ~ 3k/day or less afterwards. That's a 40% decrease in the n... [08:13:46] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The following units failed: session-204006.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice job!" [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [08:14:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "David, you should have +2 on the repo. Feel free to merge at your leisure if that's the case, alerts will be deployed automatically at the" [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [08:17:19] (03CR) 10DCausse: "looks great" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:18:14] (03CR) 10Filippo Giunchedi: [C: 04-1] "tab vs spaces but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/723220 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [08:24:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [08:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:38] (03PS2) 10Volans: remote, puppet: reduce logging verbosity [software/spicerack] - 10https://gerrit.wikimedia.org/r/723281 [08:26:40] (03PS1) 10Volans: dhcp: always require OS for option 82 config [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 [08:27:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [08:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:31] (03PS1) 10Hashar: scap: automatize plugins handling [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/723992 [08:29:38] (03PS4) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [08:30:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org [08:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:34] (03CR) 10Volans: [C: 03+2] remote, puppet: reduce logging verbosity [software/spicerack] - 10https://gerrit.wikimedia.org/r/723281 (owner: 10Volans) [08:34:43] (03CR) 10jerkins-bot: [V: 04-1] dhcp: always require OS for option 82 config [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 (owner: 10Volans) [08:35:09] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [08:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org [08:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:32] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:37:16] (03Merged) 10jenkins-bot: remote, puppet: reduce logging verbosity [software/spicerack] - 10https://gerrit.wikimedia.org/r/723281 (owner: 10Volans) [08:42:07] (03PS5) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [08:45:42] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:19] 10SRE-tools, 10Infrastructure-Foundations: Introduce Spicerack.kafka module, along with the method to transfer offset state between consumer groups and clusters - https://phabricator.wikimedia.org/T291681 (10Volans) @Zbyszko thanks a lot for being the first user of our new process as outlined in https://wikite... [08:48:35] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [08:51:52] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:06] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:47] (03CR) 10Jbond: [C: 03+1] admin: set krb attribute to 'present' for ema [puppet] - 10https://gerrit.wikimedia.org/r/723536 (owner: 10Ema) [09:04:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [09:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:24] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [09:07:34] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people2002.codfw.wmnet with reason: reboot - T291813 [09:07:35] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people2002.codfw.wmnet with reason: reboot - T291813 [09:07:39] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Joe) p:05Triage→03Medium [09:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:30] (03PS2) 10Volans: dhcp: always require OS for option 82 config [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 [09:09:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people1003.eqiad.wmnet with reason: reboot - T291813 [09:09:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people1003.eqiad.wmnet with reason: reboot - T291813 [09:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [09:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:31] (03PS2) 10Volans: sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723280 [09:10:33] (03PS1) 10Volans: sre.experimental.reimage: manage the DHCP records [cookbooks] - 10https://gerrit.wikimedia.org/r/723995 [09:11:05] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723280 (owner: 10Volans) [09:12:06] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: session-204062.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:11] (03PS1) 10Volans: install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) [09:12:50] (03CR) 10jerkins-bot: [V: 04-1] install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [09:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:51] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723280 (owner: 10Volans) [09:17:37] (03PS1) 10Filippo Giunchedi: swift: fix metrics prefix for low latency [puppet] - 10https://gerrit.wikimedia.org/r/723998 (https://phabricator.wikimedia.org/T273673) [09:17:50] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on ldap-replica1003.wikimedia.org with reason: reboot - T291813 [09:17:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ldap-replica1003.wikimedia.org with reason: reboot - T291813 [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx1001.wikimedia.org [09:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: fix metrics prefix for low latency [puppet] - 10https://gerrit.wikimedia.org/r/723998 (https://phabricator.wikimedia.org/T273673) (owner: 10Filippo Giunchedi) [09:22:28] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [09:22:30] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host thanos-fe1001.eqiad.wmnet [09:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:25] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [09:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [09:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:40] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:24:41] (03PS2) 10Volans: install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) [09:25:30] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Hiera entries for screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/723988 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [09:27:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx1001.wikimedia.org [09:27:51] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on ldap-replica1004.wikimedia.org with reason: reboot - T291813 [09:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ldap-replica1004.wikimedia.org with reason: reboot - T291813 [09:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:22] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [09:29:28] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] !log systemctl reset-failed networking T273026 [09:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:49] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [09:30:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [09:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:56] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on ldap-replica2005.wikimedia.org with reason: reboot - T291813 [09:30:59] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ldap-replica2005.wikimedia.org with reason: reboot - T291813 [09:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:32] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:17] (03PS1) 10David Caro: ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) [09:32:19] (03PS1) 10David Caro: ldap::sssd: remove unused parameter ldapincludes [puppet] - 10https://gerrit.wikimedia.org/r/724004 [09:33:06] (03CR) 10jerkins-bot: [V: 04-1] ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [09:33:58] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on ldap-replica2006.wikimedia.org with reason: reboot - T291813 [09:34:00] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ldap-replica2006.wikimedia.org with reason: reboot - T291813 [09:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [09:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:48] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [09:37:07] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [09:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1004.wikimedia.org [09:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:52] !log Optimize table commonswiki.image on codfw (s4 will show lag) - T288273 [09:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:58] T288273: Please optimize image table in commonswiki - https://phabricator.wikimedia.org/T288273 [09:40:12] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10hnowlan) a:03hnowlan [09:40:39] (03PS2) 10David Caro: ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) [09:40:41] (03PS2) 10David Caro: ldap::sssd: remove unused parameter ldapincludes [puppet] - 10https://gerrit.wikimedia.org/r/724004 [09:42:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mx2001.wikimedia.org [09:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:01] (03CR) 10David Caro: "Tested this on tools-package-builder-04, by disabling puppet, then changing the config:" [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [09:43:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [09:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [09:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:40] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: copy metadata for alert rules on deploy [puppet] - 10https://gerrit.wikimedia.org/r/720243 (owner: 10Filippo Giunchedi) [09:45:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mx2001.wikimedia.org [09:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:56] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "Indeed, thank you for the ping and apologies for the delay!" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe) [09:50:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [09:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:04] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:43] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [09:55:56] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [09:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: trove: enable service by default [puppet] - 10https://gerrit.wikimedia.org/r/724008 (https://phabricator.wikimedia.org/T291446) [09:56:21] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:47] (03CR) 10Jbond: [C: 03+1] "LGTM, question but feel free to ping and answer on irc" [cookbooks] - 10https://gerrit.wikimedia.org/r/723995 (owner: 10Volans) [09:58:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 (owner: 10Volans) [09:58:52] (03CR) 10David Caro: [C: 03+1] openstack: trove: enable service by default [puppet] - 10https://gerrit.wikimedia.org/r/724008 (https://phabricator.wikimedia.org/T291446) (owner: 10Arturo Borrero Gonzalez) [10:01:40] (03PS2) 10Arturo Borrero Gonzalez: openstack: trove: enable service by default [puppet] - 10https://gerrit.wikimedia.org/r/724008 (https://phabricator.wikimedia.org/T291446) [10:02:21] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:02:24] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: manage the DHCP records (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/723995 (owner: 10Volans) [10:02:25] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet [10:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31291/" [puppet] - 10https://gerrit.wikimedia.org/r/724008 (https://phabricator.wikimedia.org/T291446) (owner: 10Arturo Borrero Gonzalez) [10:03:43] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:04:18] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:03] (03PS16) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [10:05:05] (03PS1) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [10:05:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:06:39] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:08:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, let's try!" [puppet] - 10https://gerrit.wikimedia.org/r/720921 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [10:08:59] (03CR) 10Volans: "Compiler looks happy, I'll plan to do the cleanup manually to make it simpler:" [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:09:07] (03PS2) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [10:09:21] (03CR) 10Jbond: [C: 03+1] install_server: standardize DHCP includes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:09:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove Hiera entries for screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/723988 (https://phabricator.wikimedia.org/T288028) (owner: 10Muehlenhoff) [10:09:59] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [10:10:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31294/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [10:13:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (Hiera parts are already done)" [puppet] - 10https://gerrit.wikimedia.org/r/723543 (https://phabricator.wikimedia.org/T288028) (owner: 10Jbond) [10:17:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 cause this makes logs a bit quieter, but it's not like the IdleConnection check is any better, for the same reasons." [puppet] - 10https://gerrit.wikimedia.org/r/722278 (owner: 10Giuseppe Lavagetto) [10:17:23] (03PS3) 10Jbond: monitoring: drop monitor_screens parameter [puppet] - 10https://gerrit.wikimedia.org/r/723543 (https://phabricator.wikimedia.org/T288028) [10:18:02] <_joe_> akosiaris: actually idleconnection will tell us if a node is unresponsive/down and thus doesn't even expose the nodeport anymore [10:18:04] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10hnowlan) Given that this host is a single rack and the /srv/ partition is on an array and not using JBOD I think this host can just be shut down and the disk be replaced. [10:18:07] (03PS1) 10Effie Mouzeli: fpm-multiversion-base: trigger image rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/724017 [10:18:08] <_joe_> that's the only reason to keep it [10:18:17] (03PS4) 10Jbond: monitoring: drop monitor_screens parameter [puppet] - 10https://gerrit.wikimedia.org/r/723543 (https://phabricator.wikimedia.org/T288028) [10:19:34] _joe_: nope. It will only tell you if a pod is unresponsive/down. a pod that may very well NOT be on the node [10:20:43] it will also fail if a node is down, but it will very well fail if the specific pod the IdleConnection is to goes down. [10:21:34] <_joe_> right [10:21:49] <_joe_> I didn't mean that it won't also happen in that case [10:22:47] (03CR) 10Jbond: [C: 03+2] monitoring: drop monitor_screens parameter [puppet] - 10https://gerrit.wikimedia.org/r/723543 (https://phabricator.wikimedia.org/T288028) (owner: 10Jbond) [10:24:17] (03PS13) 10Jbond: spec tests: drop pre_conditions as its not needed [puppet] - 10https://gerrit.wikimedia.org/r/723515 [10:24:25] (03PS3) 10Volans: install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) [10:24:40] (03CR) 10Volans: "addressed" [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:25:19] (03CR) 10jerkins-bot: [V: 04-1] install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:29:07] 10SRE, 10Observability-Alerting, 10Patch-For-Review: Remove the "Long running screen/tmux" Icinga check - https://phabricator.wikimedia.org/T288028 (10MoritzMuehlenhoff) 05In progress→03Resolved Check is now gone. [10:29:14] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [10:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:56] (03PS4) 10Jbond: install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1030). [10:32:32] (03PS5) 10Volans: install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) [10:35:02] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [10:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [10:42:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [10:43:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ldap::sssd: remove unused parameter ldapincludes [puppet] - 10https://gerrit.wikimedia.org/r/724004 (owner: 10David Caro) [10:44:17] (03CR) 10Volans: "Compiler still happy:" [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:53:00] (03PS17) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [10:53:02] (03PS3) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [10:59:29] PROBLEM - Disk space on aqs1007 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 107546 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1007&var-datasource=eqiad+prometheus/ops [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1100). [11:00:05] No Gerrit patches in the queue for this window AFAICS. [11:00:13] :( [11:02:08] (03PS1) 10Jbond: wmflib::dir::mkdir_p: fix issue cause by trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/724025 [11:02:20] !log disabling puppet on install hosts to deploy 723996 - T221388 [11:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:26] T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 [11:02:34] (03CR) 10Volans: [C: 03+2] install_server: standardize DHCP includes [puppet] - 10https://gerrit.wikimedia.org/r/723996 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [11:07:57] (03PS18) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [11:09:31] !log re-enabled puppet on install hosts after deployment of g/723996 - T221388 [11:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:37] T221388: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 [11:12:59] (03CR) 10Jbond: [C: 03+2] spec tests: drop pre_conditions as its not needed [puppet] - 10https://gerrit.wikimedia.org/r/723515 (owner: 10Jbond) [11:13:31] (03PS13) 10Jbond: C:base::monitoring::host: Add type definitions [puppet] - 10https://gerrit.wikimedia.org/r/723494 [11:14:00] (03PS19) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [11:14:15] (03PS14) 10Jbond: C:base::monitoring::host: Add type definitions [puppet] - 10https://gerrit.wikimedia.org/r/723494 [11:15:55] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/724035 (owner: 10L10n-bot) [11:19:41] (03CR) 10Volans: [C: 03+2] dhcp: always require OS for option 82 config [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 (owner: 10Volans) [11:25:04] (03Merged) 10jenkins-bot: dhcp: always require OS for option 82 config [software/spicerack] - 10https://gerrit.wikimedia.org/r/723991 (owner: 10Volans) [11:26:15] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724025 (owner: 10Jbond) [11:26:26] (03PS20) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [11:26:47] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/724035 (owner: 10L10n-bot) [11:26:56] (03CR) 10jerkins-bot: [V: 04-1] profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [11:28:02] (03PS21) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [11:31:23] (03PS1) 10Muehlenhoff: Add microsite for OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724042 [11:31:28] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31303/console" [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [11:40:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724042 (owner: 10Muehlenhoff) [11:41:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [11:43:25] !log Turn off es2021 for onsite maintenance T290327 [11:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:30] T290327: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 [11:45:10] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10akosiaris) From the #SRE side, we 've built and support * https://docker-registry.wikimedia.org/nodejs12-devel/tags/ * https://docker-registry.wikimedia.org/nodejs12-slim/tags/ They are b... [11:45:30] !log Upgrade es4 in codfw to 10.4.21 [11:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) @papaul es2021 is now off and ready for you. I have also upgraded the replicas to 10.4.21. [11:52:02] (03PS1) 10Muehlenhoff: Add os-reports.w.o to caches config [puppet] - 10https://gerrit.wikimedia.org/r/724044 [11:53:25] (03PS1) 10Muehlenhoff: Add DNS record for os-reports.w.o [dns] - 10https://gerrit.wikimedia.org/r/724045 [11:56:38] (03CR) 10Jbond: "https://puppet-compiler.wmflabs.org/compiler1002/31300/" [puppet] - 10https://gerrit.wikimedia.org/r/723494 (owner: 10Jbond) [11:58:01] (03Abandoned) 10Jbond: profile::sysctl: add ability to control ip_forward: [puppet] - 10https://gerrit.wikimedia.org/r/715217 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [12:00:20] (03PS28) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [12:02:55] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [12:05:10] (03PS1) 10Majavah: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) [12:06:43] (03PS6) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [12:07:20] (03PS7) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [12:11:18] (03PS6) 10DCausse: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) [12:11:27] (03CR) 10DCausse: [C: 03+1] "Thanks for the review" [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [12:13:34] (03PS29) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [12:14:56] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [12:15:04] (03PS2) 10Jbond: wmflib::dir::mkdir_p: fix issue cause by trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/724025 [12:15:17] (03CR) 10Jbond: "thx" [puppet] - 10https://gerrit.wikimedia.org/r/724025 (owner: 10Jbond) [12:17:20] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [12:17:29] (03CR) 10Jbond: [C: 03+2] C:base::monitoring::host: Add type definitions [puppet] - 10https://gerrit.wikimedia.org/r/723494 (owner: 10Jbond) [12:18:54] (03PS15) 10Jbond: P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) [12:20:52] (03PS9) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [12:21:32] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [12:22:42] (03PS16) 10Jbond: P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) [12:23:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724025 (owner: 10Jbond) [12:23:43] (03PS10) 10Jbond: P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 [12:24:20] (03CR) 10jerkins-bot: [V: 04-1] P:base: move base::monitoring::host to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/723544 (owner: 10Jbond) [12:24:26] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:24:59] (03PS2) 10Muehlenhoff: acmechief: Remove mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/723422 (https://phabricator.wikimedia.org/T286911) [12:25:38] (03PS2) 10Muehlenhoff: Configure a few domains with equal weights for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/723473 (https://phabricator.wikimedia.org/T286911) [12:28:08] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/724042 (owner: 10Muehlenhoff) [12:29:12] (03CR) 10Muehlenhoff: [C: 03+2] Configure a few domains with equal weights for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/723473 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:30:54] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore any nits." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) (owner: 10Majavah) [12:32:41] (03CR) 10Michael DiPietro: "https://puppet-compiler.wmflabs.org/compiler1002/31309/" [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [12:32:44] (03CR) 10Michael DiPietro: [C: 03+2] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [12:40:35] (03PS2) 10Muehlenhoff: Add microsite for OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724042 [12:41:07] (03CR) 10Muehlenhoff: Add microsite for OS reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724042 (owner: 10Muehlenhoff) [12:41:09] (03CR) 10Jbond: [C: 03+1] Add microsite for OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724042 (owner: 10Muehlenhoff) [12:41:34] (03CR) 10Jbond: [C: 03+1] Add DNS record for os-reports.w.o [dns] - 10https://gerrit.wikimedia.org/r/724045 (owner: 10Muehlenhoff) [12:42:02] (03CR) 10Jbond: [C: 03+2] wmflib::dir::mkdir_p: fix issue cause by trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/724025 (owner: 10Jbond) [12:44:24] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, and 2 others: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) 05Open→03Resolved [12:46:40] (03PS1) 10Muehlenhoff: microsites: Switch to wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/724053 [12:47:21] (03PS3) 10Muehlenhoff: Revert "Prefer codfw wiki smarthost over eqiad one for mx1001 reimage" [puppet] - 10https://gerrit.wikimedia.org/r/723433 (https://phabricator.wikimedia.org/T286911) [12:50:08] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Prefer codfw wiki smarthost over eqiad one for mx1001 reimage" [puppet] - 10https://gerrit.wikimedia.org/r/723433 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:54:57] (03PS1) 10Filippo Giunchedi: pontoon: set diamond::remove to true [puppet] - 10https://gerrit.wikimedia.org/r/724056 [12:54:59] (03PS1) 10Filippo Giunchedi: pontoon: set cloud nameservers explicitly [puppet] - 10https://gerrit.wikimedia.org/r/724057 [12:56:34] (03CR) 10Jbond: [C: 03+1] "Sorry for missing this" [puppet] - 10https://gerrit.wikimedia.org/r/724057 (owner: 10Filippo Giunchedi) [12:56:51] (03CR) 10Jbond: [C: 03+1] pontoon: set diamond::remove to true [puppet] - 10https://gerrit.wikimedia.org/r/724056 (owner: 10Filippo Giunchedi) [12:57:23] (03CR) 10Jbond: [C: 03+1] Add os-reports.w.o to caches config [puppet] - 10https://gerrit.wikimedia.org/r/724044 (owner: 10Muehlenhoff) [13:01:19] (03CR) 10Jelto: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/724057 (owner: 10Filippo Giunchedi) [13:01:32] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/724056 (owner: 10Filippo Giunchedi) [13:03:36] (03PS22) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [13:03:48] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set diamond::remove to true [puppet] - 10https://gerrit.wikimedia.org/r/724056 (owner: 10Filippo Giunchedi) [13:04:30] (03CR) 10Filippo Giunchedi: [C: 03+2] "No worries John! Thanks for the quick review both" [puppet] - 10https://gerrit.wikimedia.org/r/724057 (owner: 10Filippo Giunchedi) [13:04:45] (03PS2) 10Filippo Giunchedi: pontoon: set cloud nameservers explicitly [puppet] - 10https://gerrit.wikimedia.org/r/724057 [13:04:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31310/console" [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [13:05:31] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:07:31] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:25:32] dcausse: hi, I'm interested in debugging why you can't merge https://gerrit.wikimedia.org/r/c/operations/alerts/+/720066, do you have some time now to look? the 'wmf' ldap group has 'submit' privileges to the repo so afaics it should work, what options show up for you if you click 'reply' ? [13:25:59] godog: sure, looking [13:26:32] godog: submit needs CR+2/V+2 on the patch [13:26:35] (03PS1) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [13:26:50] I see CR -1, 0, +1 [13:27:08] dcausse: thanks! ok looking [13:27:09] if you have jenkins on the repo, you'll want to give access to the code-review label (-2...+2) instead of "submit", which is the "bypass jenkins and merge immediately" button [13:27:29] majavah: thank you, I'll try that now [13:27:34] (03PS2) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [13:28:17] dcausse: is it better now? -2/+2 should show up [13:28:39] godog: yes, shipping, thanks!! [13:28:43] (03CR) 10DCausse: [C: 03+2] search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [13:28:47] woot woot [13:29:05] thank you majavah [13:29:22] ok so jenkins should do its thing now mmhh and then submit the change? [13:29:36] dcausse: does gerrit let you submit the patch now ? [13:30:13] yeah, jenkins will auto merge after running tests on +2 [13:30:14] I think it needs V+2 before showing me this option, but jenkins is running [13:30:43] I can't force V+2 (which is good I think) [13:30:58] (this is the preferred way to do things, but few sre repos are special and require manual submit) [13:31:10] ack, yeah that makes sense [13:31:20] I added the ability to force verified too, just in case [13:31:49] ok so now jenkins has voted +2 and you should be able to submit dcausse [13:32:36] yep I see the submit button [13:33:10] I guess jenkins is missing access to submit [13:33:16] sweet, yeah after submit puppet will take care of the rest at the next puppet run [13:33:16] (waiting for jenkins to submit) [13:33:36] mmhh which group should I add with submit permissions ? [13:33:50] jenkinsbot I suppose [13:34:08] I guess so [13:34:25] ok trying again [13:34:40] (03CR) 10Filippo Giunchedi: [C: 03+2] search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [13:35:26] (03PS5) 10Gehel: Add kafka clusters' brokers to spicerack config [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:35:53] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@04d2df4]: tegola: use eqiad discovery endpoin [13:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] (03CR) 10Herron: [C: 03+1] Configure remaining domains with equal weights for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/723482 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [13:36:08] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@04d2df4]: tegola: use eqiad discovery endpoin (duration: 00m 15s) [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:22] ^ nemo-yiannis [13:36:41] (03PS3) 10Daimona Eaytoy: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [13:37:11] (03Merged) 10jenkins-bot: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [13:37:13] (03CR) 10Daimona Eaytoy: "Ping? Seems like CSP is no longer report-only, so it's erroring now." [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [13:38:03] \o/ it worked, thanks dcausse and majavah [13:38:03] godog: thanks! (this new AM system is really great to setup) [13:38:18] (03CR) 10Gehel: [C: 03+2] Add kafka clusters' brokers to spicerack config [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:38:44] dcausse: cheers! I'm glad as that was one of the main goals, thank you for the feedback it is super useful [13:42:02] (03PS3) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [13:43:23] (03CR) 10Muehlenhoff: [C: 03+2] Add microsite for OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724042 (owner: 10Muehlenhoff) [13:46:05] (03PS1) 10Urbanecm: updateMenteeData: Do not calculate last edit timestamp twice [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723781 (https://phabricator.wikimedia.org/T290609) [13:46:23] (03PS1) 10Urbanecm: updateMenteeData.php: Decrease number of write queries made [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723782 (https://phabricator.wikimedia.org/T291658) [13:46:48] jouncebot: nowandnext [13:46:48] No deployments scheduled for the next 3 hour(s) and 13 minute(s) [13:46:48] In 3 hour(s) and 13 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1700) [13:47:10] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData: Do not calculate last edit timestamp twice [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723781 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [13:47:17] (03CR) 10Urbanecm: [C: 03+2] updateMenteeData.php: Decrease number of write queries made [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723782 (https://phabricator.wikimedia.org/T291658) (owner: 10Urbanecm) [13:48:49] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:10] (03CR) 10Herron: [V: 03+2 C: 03+2] Revert "Revert "slo_dashboard: switch etcd request slo query to recording rule metrics"" (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/722904 (owner: 10Herron) [13:52:25] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:55] !log otto@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=eventgate-logging-external [13:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:34] !beginning re-deploy of eventgate-logging-external - https://phabricator.wikimedia.org/T291504#7380252 [13:58:40] !log beginning re-deploy of eventgate-logging-external - https://phabricator.wikimedia.org/T291504#7380252 [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:01] !log otto@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-logging-external,name=codfw [13:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:48] (03CR) 10Muehlenhoff: [C: 03+2] Add os-reports.w.o to caches config [puppet] - 10https://gerrit.wikimedia.org/r/724044 (owner: 10Muehlenhoff) [14:00:56] (03PS2) 10Muehlenhoff: Add os-reports.w.o to caches config [puppet] - 10https://gerrit.wikimedia.org/r/724044 [14:03:44] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] fpm-multiversion-base: trigger image rebuild [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/724017 (owner: 10Effie Mouzeli) [14:07:05] (03CR) 10jerkins-bot: [V: 04-1] updateMenteeData.php: Decrease number of write queries made [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723782 (https://phabricator.wikimedia.org/T291658) (owner: 10Urbanecm) [14:07:10] :( [14:07:56] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "tests broken in wmf branch, master branch passes, low-risk patch" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723782 (https://phabricator.wikimedia.org/T291658) (owner: 10Urbanecm) [14:10:45] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:13] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:11:13] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:11:20] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Papaul) DIMM A1 swapped with DIMM B2 leaving the task open for now to monitoring if we will see the issue on DIMM B1 Thanks [14:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:30] (03CR) 10jerkins-bot: [V: 04-1] updateMenteeData: Do not calculate last edit timestamp twice [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723781 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [14:11:41] (03PS2) 10Urbanecm: updateMenteeData: Do not calculate last edit timestamp twice [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723781 (https://phabricator.wikimedia.org/T290609) [14:11:52] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "tests broken in wmf branch, master branch passes, low-risk patch" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723781 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [14:12:15] PROBLEM - LVS eventgate-logging-external codfw port 4392/tcp - EventGate logging endpoint- eventgate-logging-external.svc.codfw.wmnet and intake-logging.wikimedia.org IPv4 on eventgate-logging-external.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.50 and port 4392: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:12:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_logging_external_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:12:45] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_4392: Servers kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wi [14:13:18] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-logging-external_4392: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wi [14:14:44] !log otto@deploy1002 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=codfw [14:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:25] (03PS1) 10David Caro: base: Add test and fix notification condition [puppet] - 10https://gerrit.wikimedia.org/r/724074 [14:15:52] ok looks like i need to silence some alerts in ther [14:16:07] (03CR) 10David Caro: P:base: make notifications_enabled a boolean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:16:12] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) Couple of more points: * https://grafana.wikimedia.org/goto/RY75JPHnz points out that wikifeeds envoy did indeed see the errors. In both downstream (the l... [14:16:35] (03CR) 10jerkins-bot: [V: 04-1] base: Add test and fix notification condition [puppet] - 10https://gerrit.wikimedia.org/r/724074 (owner: 10David Caro) [14:18:36] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) Thanks @Papaul - mysql started. Let's give it a week and if nothing arises, let's close it [14:18:49] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Papaul) ` Clear Log Save As Mon Sep 27 2021 14:13:45 The system board 5V SWITCH PG voltage is outside of range. Mon Sep 27 2021 14:13:37 The system board fail-safe voltage is outside... [14:21:33] (testing my patches above takes some time, still deploying) [14:24:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:04] dcausse: ah yeah there's a problem (obvious in hindsight) with WdqsStreamingUpdaterFlinkJobNotRunning [14:30:11] I'll have to think a little what's the best way to approach the problem [14:30:37] !log otto@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=codfw [14:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:45] godog: saw the message but not sure what's the problem other than the mail being moderated [14:32:33] dcausse: yeah the moderation is one, allowing the sender I think should be enough [14:33:02] dcausse: the other problem is using absent() which obviously fails on instances/sites where wdqs isn't deployed, i.e. https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:33:35] oh I see [14:33:45] !log volker-e@deploy1002 Started deploy [design/style-guide@9b3b0fb]: Deploy design/style-guide: 9b3b0fb “Apps”: Fix typos and unify orthography (#491) [14:33:51] !log volker-e@deploy1002 Finished deploy [design/style-guide@9b3b0fb]: Deploy design/style-guide: 9b3b0fb “Apps”: Fix typos and unify orthography (#491) (duration: 00m 06s) [14:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:33] (03PS2) 10David Caro: base: Add test and fix notification condition [puppet] - 10https://gerrit.wikimedia.org/r/724074 [14:35:31] (03PS2) 10Muehlenhoff: Add DNS record for os-reports.w.o [dns] - 10https://gerrit.wikimedia.org/r/724045 [14:36:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:55] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/GrowthExperiments/: 08f1e73: 3b154db: GrowthExperiments backports (T290609, T291658) (duration: 00m 58s) [14:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:01] T291658: updateMenteeData.php submits a lot of DB queries - https://phabricator.wikimedia.org/T291658 [14:37:02] T290609: Make mentee overview module's updateMenteeData.php scale better - https://phabricator.wikimedia.org/T290609 [14:37:52] yeah I think we'll probably have to be able to select the specific instance and the site, sigh [14:38:02] !log otto@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-logging-external,name=codfw [14:38:04] I'll update T289662 [14:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:07] T289662: Add ability to select with site-local Prometheus instance to deploy alerts - https://phabricator.wikimedia.org/T289662 [14:38:46] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS record for os-reports.w.o [dns] - 10https://gerrit.wikimedia.org/r/724045 (owner: 10Muehlenhoff) [14:38:59] mmhh or maybe only the instance really [14:39:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:22] !log /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/updateMenteeData.php --statsd # measuring time backports saved [14:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:45] (03PS1) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [14:44:27] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [14:45:37] (03PS17) 10Jbond: P:base: make notifications_enabled a boolean [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) [14:46:05] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/723509 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:46:22] (03PS8) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [14:48:42] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [14:48:44] (03PS2) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [14:50:02] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [14:52:41] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:56:42] (03PS3) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [14:58:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [14:58:49] (03PS9) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [15:04:00] (03PS4) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [15:04:49] (03CR) 10jerkins-bot: [V: 04-1] Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [15:05:34] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [15:06:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) [15:06:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) all firmware updated [15:07:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) [15:07:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) all firmware updated [15:07:58] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) [15:08:03] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) all firmware updated [15:08:13] (03PS5) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [15:11:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [15:16:56] (03PS6) 10Muehlenhoff: Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 [15:20:37] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [15:21:02] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 (owner: 10Herron) [15:21:17] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 (owner: 10Herron) [15:21:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [15:22:03] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: make jmx_ params optional [puppet] - 10https://gerrit.wikimedia.org/r/721370 (owner: 10Herron) [15:27:36] I seem to be having issues connecting to esams, anything known going on at the moment? [15:27:50] I have some issues too [15:27:52] eqiad, codfw, eqsin, ulsfo work [15:28:12] now it’s back [15:29:33] yeah, now it’s fully working for me again [15:29:46] (03PS4) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [15:31:17] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 451 probes of 709 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:31:35] (03CR) 10ZPapierski: [C: 03+1] query service: Fix loading of DCAT-AP dataset [puppet] - 10https://gerrit.wikimedia.org/r/720746 (https://phabricator.wikimedia.org/T289517) (owner: 10DCausse) [15:32:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nice job" [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:32:18] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 452 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:32:23] (03PS10) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [15:32:47] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:32:56] ^ first time I see atlas alerts to match reality at all [15:33:08] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:33:16] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: investigate caching of mailman listinfo pages - https://phabricator.wikimedia.org/T197819 (10Legoktm) 05Open→03Resolved This doesn't appear to be an issue with Mailman3 anymore. [15:33:31] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:33:48] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:33:58] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989 (10Legoktm) [15:34:14] (03PS30) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [15:34:18] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:34:34] (03CR) 10Jbond: "I have resolved all comments, i have a few more minor fixes to make before asking for additnal review" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [15:35:22] Grafana dashboard also agrees that esams already recovered [15:36:32] 10SRE, 10Internet-Archive, 10Wikimedia-Mailing-lists: Consider allowing mailing lists to be indexed by archive.org - https://phabricator.wikimedia.org/T193573 (10Legoktm) 05Open→03Resolved Semi-intentionally, the robots.txt for Mailman2 wasn't ported over to Mailman3, we'll see how this goes now. I think... [15:37:18] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 6 probes of 709 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:38:23] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 40 probes of 626 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:42:41] (03PS1) 10Legoktm: mailman: Redirect /mailman/subscribe/$listname URLs too [puppet] - 10https://gerrit.wikimedia.org/r/724087 (https://phabricator.wikimedia.org/T167900) [15:46:21] (03PS1) 10Giuseppe Lavagetto: remove tokens for production services from CI. [labs/private] - 10https://gerrit.wikimedia.org/r/724088 [15:48:20] (03CR) 10Legoktm: [C: 03+2] admin: Deprecate mailman-admins group [puppet] - 10https://gerrit.wikimedia.org/r/723673 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:48:31] (03CR) 10Legoktm: [C: 03+2] mailman: More mailman2 clean ups [puppet] - 10https://gerrit.wikimedia.org/r/723674 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:52:04] (03CR) 10Legoktm: "I manually deleted /etc/apache2/arbcom-l.htdigest off of lists1001 and removed it from private puppet too." [puppet] - 10https://gerrit.wikimedia.org/r/723674 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [15:55:16] (03CR) 10Jbond: [C: 03+1] "LGTM but see comment" [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [15:59:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] remove tokens for production services from CI. [labs/private] - 10https://gerrit.wikimedia.org/r/724088 (owner: 10Giuseppe Lavagetto) [15:59:26] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] remove tokens for production services from CI. [labs/private] - 10https://gerrit.wikimedia.org/r/724088 (owner: 10Giuseppe Lavagetto) [15:59:44] (03PS23) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [15:59:46] (03PS4) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [16:00:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31315/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [16:01:12] (03PS1) 10Ottomata: Revert "Eventgate: Symlink _helpers and _tls_helpers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723785 [16:01:39] jouncebot: nowandnext [16:01:39] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:39] In 0 hour(s) and 58 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1700) [16:01:54] !log Livehack debugging at mwmaint1002 for T291836 [16:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:02] T291836: updateMenteeData.php runs weird queries - https://phabricator.wikimedia.org/T291836 [16:02:27] (03PS1) 10Ottomata: Revert "Update eventgate helmfile.d for eventgate 0.5 chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724106 [16:03:26] (03PS2) 10Ottomata: Revert "Eventgate: Symlink _helpers and _tls_helpers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723785 [16:03:47] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724087 (https://phabricator.wikimedia.org/T167900) (owner: 10Legoktm) [16:05:40] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists.wikimedia.org reporting "You must GET the form before submitting it" for all list subscription attempts - https://phabricator.wikimedia.org/T185222 (10Legoktm) Without any server-side logs from when this was reported it's hard to say what was going... [16:06:23] (03CR) 10Legoktm: [C: 03+2] mailman: Redirect /mailman/subscribe/$listname URLs too [puppet] - 10https://gerrit.wikimedia.org/r/724087 (https://phabricator.wikimedia.org/T167900) (owner: 10Legoktm) [16:08:25] (03CR) 10Ottomata: [C: 03+2] Revert "Eventgate: Symlink _helpers and _tls_helpers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723785 (owner: 10Ottomata) [16:08:30] !log [urbanecm@mwmaint1002 ~]$ scap pull # T291836 [16:08:34] (03CR) 10Ottomata: [C: 03+2] Revert "Update eventgate helmfile.d for eventgate 0.5 chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/724106 (owner: 10Ottomata) [16:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:38] T291836: updateMenteeData.php runs weird queries - https://phabricator.wikimedia.org/T291836 [16:08:56] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists.wikimedia.org reporting "You must GET the form before submitting it" for all list subscription attempts - https://phabricator.wikimedia.org/T185222 (10Legoktm) 05Open→03Resolved a:03Legoktm [16:10:45] !log reverting eventgate-logging-external chart change in codfw - T291504 [16:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] T291504: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 [16:12:30] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] (03PS1) 10Volans: puppet: fix check exception inheritance [software/spicerack] - 10https://gerrit.wikimedia.org/r/724094 [16:14:18] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:14:19] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:25] (03PS2) 10Volans: puppet: fix check exception inheritance [software/spicerack] - 10https://gerrit.wikimedia.org/r/724094 [16:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:29] RECOVERY - LVS eventgate-logging-external codfw port 4392/tcp - EventGate logging endpoint- eventgate-logging-external.svc.codfw.wmnet and intake-logging.wikimedia.org IPv4 on eventgate-logging-external.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 841 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:16:01] jouncebot: nowandnext [16:16:01] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [16:16:01] In 0 hour(s) and 43 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1700) [16:16:24] (03PS5) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [16:16:25] ooh, nice command [16:16:31] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:34] !log otto@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=codfw [16:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:00] !log otto@puppetmaster1001 conftool action : set/ttl=300; selector: dnsdisc=eventgate-logging-external [16:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:23] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:18:41] 10SRE, 10ops-eqiad, 10Analytics: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10wiki_willy) a:03Cmjohnson [16:18:58] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:06] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10wiki_willy) a:05hnowlan→03Cmjohnson Just a FYI @Cmjohnson - it looks like this one is still under warranty for another a month or so. Thanks, Willy [16:20:25] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Eevans) >>! In T291738#7379805, @hnowlan wrote: > Given that this host is a single rack and the /srv/ partition is on an array and not using JBOD I think this host can just be shut dow... [16:22:04] 10SRE, 10ops-eqiad, 10Analytics-Clusters: analytics1069 mgmt interface intermittently goes up and down - https://phabricator.wikimedia.org/T291732 (10odimitrijevic) [16:23:27] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [16:24:52] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) I 've went ahead and created https://grafana-rw.wikimedia.org/d/Y1UyyEH7z/t290445?orgId=1 to depict the findings in grafana a bit more. It's not the entire... [16:25:33] (03CR) 10Volans: "Thanks for the contribution! I did a first pass a the code (skipped the test for now)." [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [16:29:23] 10SRE, 10serviceops, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) This bug fix is definitely one of the reasons, but we also rolled out a number of other perf tweaks / improvements that Tim made in Parsoid itself in... [16:31:26] (03PS6) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [16:38:07] (03CR) 10Jbond: [C: 03+1] puppet: fix check exception inheritance [software/spicerack] - 10https://gerrit.wikimedia.org/r/724094 (owner: 10Volans) [16:41:14] (03CR) 10Volans: [C: 03+2] puppet: fix check exception inheritance [software/spicerack] - 10https://gerrit.wikimedia.org/r/724094 (owner: 10Volans) [16:41:57] (03PS1) 10Ladsgroup: Expand local URLs to absolute URLs in ParserOutput [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) [16:42:12] (03CR) 10Ladsgroup: [C: 03+2] "Tested in mwdebug1001, works fine. Deploying." [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [16:44:39] (03PS1) 10Majavah: kubeadm: Disable PodPresets [puppet] - 10https://gerrit.wikimedia.org/r/724101 (https://phabricator.wikimedia.org/T279106) [16:44:51] (03CR) 10Majavah: [C: 04-1] "not quite yet" [puppet] - 10https://gerrit.wikimedia.org/r/724101 (https://phabricator.wikimedia.org/T279106) (owner: 10Majavah) [16:47:20] (03PS1) 10Zabe: swift: remove absented swift-drive-audit cron [puppet] - 10https://gerrit.wikimedia.org/r/724102 (https://phabricator.wikimedia.org/T273673) [16:47:51] (03CR) 10Zabe: "Tests are going to fail, due to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/723703 not being merged yet." [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [16:48:40] (03Merged) 10jenkins-bot: puppet: fix check exception inheritance [software/spicerack] - 10https://gerrit.wikimedia.org/r/724094 (owner: 10Volans) [16:49:43] (03CR) 10Ladsgroup: [C: 03+2] PHPUnit: enable convertDeprecationsToExceptions [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723703 (https://phabricator.wikimedia.org/T291731) (owner: 10Zabe) [16:49:48] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Cmjohnson) @Eevans I do see the failed raid and will put in a ticket but before I do that I would like to upgrade the f/w so Dell doesn't use that as an excuse to push back. This wil... [16:50:05] (03CR) 10Ladsgroup: [C: 03+2] Expand local URLs to absolute URLs in ParserOutput (031 comment) [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [16:50:44] (03PS1) 10AOkoth: gitlab: test edit [puppet] - 10https://gerrit.wikimedia.org/r/724104 [16:53:03] (03PS1) 10Hnowlan: cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 [16:53:41] (03CR) 10jerkins-bot: [V: 04-1] cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [16:54:49] (03PS2) 10Hnowlan: cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 [16:55:53] (03CR) 10Dzahn: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/724104 (owner: 10AOkoth) [16:58:37] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31318/console" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [16:59:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/724126 [17:00:04] ryankemper: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1700). Please do the needful. [17:00:53] (03PS1) 10Jgiannelos: tegola-vector-tiles: Exclude master nodes from postgres proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 [17:01:49] (03PS3) 10Hnowlan: cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 [17:03:20] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31319/console" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [17:06:20] (03CR) 10jerkins-bot: [V: 04-1] Expand local URLs to absolute URLs in ParserOutput [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [17:09:02] (03CR) 10Ladsgroup: [C: 03+2] "." [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [17:12:21] (03PS7) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [17:12:42] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/724126 (owner: 10Volans) [17:13:49] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:12] (03PS1) 10Volans: Upstream release v1.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/724131 [17:16:16] (03Merged) 10jenkins-bot: PHPUnit: enable convertDeprecationsToExceptions [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723703 (https://phabricator.wikimedia.org/T291731) (owner: 10Zabe) [17:16:21] (03CR) 10Dzahn: [C: 03+2] gitlab: test edit [puppet] - 10https://gerrit.wikimedia.org/r/724104 (owner: 10AOkoth) [17:18:35] (03PS1) 10Cmjohnson: Adding new elastic servers to dhcp and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724133 (https://phabricator.wikimedia.org/T281989) [17:20:23] (03CR) 10Cmjohnson: [C: 03+2] Adding new elastic servers to dhcp and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724133 (https://phabricator.wikimedia.org/T281989) (owner: 10Cmjohnson) [17:24:19] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-ro*.eqiad.wmnet [17:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:50] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica* [17:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica1003.wikimedia.org [17:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:18] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica1004.wikimedia.org [17:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:22] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica200*.wikimedia.org [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:33] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica2005.wikimedia.org [17:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:39] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica2006.wikimedia.org [17:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:13] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=ldap-replica2006.wikimedia.org [17:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:11] (03PS1) 10Ottomata: eventgate/_tls_helpers.tlp - make more like common template [deployment-charts] - 10https://gerrit.wikimedia.org/r/724137 (https://phabricator.wikimedia.org/T291856) [17:29:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, and 2 others: Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` elastic1068.eqiad.wmnet `... [17:32:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:13] (03PS1) 10Bartosz Dziewoński: ChangeTags: Set interface flag when parsing tag names [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724110 (https://phabricator.wikimedia.org/T291776) [17:34:18] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/724131 (owner: 10Volans) [17:37:43] (03PS1) 10Dzahn: Revert "gitlab: test edit" [puppet] - 10https://gerrit.wikimedia.org/r/724111 [17:38:02] (03Merged) 10jenkins-bot: Expand local URLs to absolute URLs in ParserOutput [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724109 (https://phabricator.wikimedia.org/T263581) (owner: 10Ladsgroup) [17:39:01] !log uploaded spicerack_1.0.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [17:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:28] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.1/includes/page/Article.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part I (duration: 00m 59s) [17:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:34] T263581: Find out the reason and potentially eliminate ParserCache split on action:render - https://phabricator.wikimedia.org/T263581 [17:43:37] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.1/includes/parser/ParserOutput.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part II (duration: 00m 57s) [17:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:46] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.1/includes/parser/ParserCache.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part III (duration: 00m 56s) [17:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:05] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.1/includes/Title.php: Backport: [[gerrit:724109|Expand local URLs to absolute URLs in ParserOutput (T263581)]], Part IV (duration: 00m 56s) [17:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] jouncebot: nowandnext [17:56:00] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [17:56:00] In 0 hour(s) and 3 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1800) [17:59:07] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T1800) [18:00:04] zabe and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] i can deploy today [18:00:22] (03CR) 10Ebernhardson: "Tested by manually copying the rendered nginx config into individual servers (and letting puppet change it back on next run). The rendered" [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [18:00:23] zabe: MatmaRex: hi, around? [18:00:31] o/ [18:02:16] zabe: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/723703 is already backported, and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/723729 is an i18n-only change, which we generally don't backport, as it takes long time. [18:02:24] Is the i18n one urgent? [18:03:42] no, not realy since it's only en-gb. If you don't want to do that, thats fine. [18:04:24] hi [18:04:42] hi MatmaRex [18:04:52] (03CR) 10Urbanecm: [C: 03+2] ChangeTags: Set interface flag when parsing tag names [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724110 (https://phabricator.wikimedia.org/T291776) (owner: 10Bartosz Dziewoński) [18:04:56] I'll ping you when ready [18:05:00] * urbanecm goes to deploy some Growth stuff [18:06:57] (03PS1) 10Urbanecm: Growth: Promote 208 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724140 (https://phabricator.wikimedia.org/T290582) [18:07:34] (03CR) 10Urbanecm: [C: 03+2] Growth: Promote 208 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724140 (https://phabricator.wikimedia.org/T290582) (owner: 10Urbanecm) [18:08:39] (03Merged) 10jenkins-bot: Growth: Promote 208 wikis out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724140 (https://phabricator.wikimedia.org/T290582) (owner: 10Urbanecm) [18:10:00] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) Looks like this got deployed last week with the train? I'm not seeing any changes in the server-sid... [18:10:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2cb6f47ba4b739fa7bc8a2036b473c014ff69c8b: Growth: Promote 208 wikis out of dark mode (T290582) (duration: 00m 56s) [18:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:31] T290582: Deploy Growth features to all remaining active versions of Wikipedia - https://phabricator.wikimedia.org/T290582 [18:11:31] a joyful moment, Growth features are now on most Wikipedias [18:12:02] (03CR) 10Ppchelko: [C: 03+1] eventgate/_tls_helpers.tlp - make more like common template [deployment-charts] - 10https://gerrit.wikimedia.org/r/724137 (https://phabricator.wikimedia.org/T291856) (owner: 10Ottomata) [18:14:18] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Eevans) >>! In T291738#7380924, @Cmjohnson wrote: > @Eevans I do see the failed raid and will put in a ticket but before I do that I would like to upgrade the f/w so Dell doesn't use t... [18:16:14] !log Deployed patch for T284419 [18:16:16] (03CR) 10Jforrester: "I vaguely recall that this wasn't enabled for Wikipedias as it has significant performance implications; does anyone remember whether this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:20:26] 10SRE, 10Analytics, 10Data-Engineering, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) We need to enable it per eventgate service. Patch OTW... [18:20:28] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [18:21:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:52] i guess CI is unhappy today, huh [18:24:08] MatmaRex: just slow, I don't see any -1's? [18:24:21] yeah [18:24:29] yeah, sorry, that's what i meant [18:24:39] np, just misunderstood you [18:24:43] I have time 🙂 [18:25:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:26:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:26:12] (03PS1) 10Ottomata: EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724143 (https://phabricator.wikimedia.org/T288853) [18:26:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:27:28] (03Merged) 10jenkins-bot: ChangeTags: Set interface flag when parsing tag names [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724110 (https://phabricator.wikimedia.org/T291776) (owner: 10Bartosz Dziewoński) [18:27:47] finally [18:28:14] urbanecm: i have a config patch to deploy once you are finished with backport window. [18:28:20] no hurry [18:28:21] MatmaRex: mwdebug1001 has your patch, please test [18:28:23] ottomata: noted [18:28:28] i'll ping you when all-clear [18:28:31] thank you [18:28:33] np [18:29:21] thanks, looking [18:30:00] !log updating firmware on sessionstore1003 [18:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:17] urbanecm: looks good [18:30:21] thanks, syncing [18:31:03] PROBLEM - Host sessionstore1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:32] cmjohnson1: ^^, you might want to ack/downtime/whatever [18:32:03] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.1/includes/changetags/ChangeTags.php: b1f4b4e45b37792534e0aef4e636a26c369cc6a8: ChangeTags: Set interface flag when parsing tag names (T291776) (duration: 00m 56s) [18:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:09] T291776: Edit tags containing newlines - https://phabricator.wikimedia.org/T291776 [18:32:16] 18:31:46 Check 'Logstash Error rate for mw1417.eqiad.wmnet' failed: ERROR: 18% OVER_THRESHOLD (Avg. Error rate: Before: 0.16, After: 2.00, Threshold: 1.63), this...isn't good [18:32:25] MatmaRex: logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6, can you help me check? [18:32:39] looking [18:33:32] ACKNOWLEDGEMENT - Host sessionstore1003 is DOWN: PING CRITICAL - Packet loss = 100% Chris Johnson updating firmware - The acknowledgement expires at: 2021-09-28 19:33:06. [18:33:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1068.eqiad.wmnet with reason: REIMAGE [18:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:52] urbanecm: doesn't look related to the patch, does it? [18:33:56] i don't know what would cause it htough [18:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:34:24] not sure, it's just scap warns me about unexpected exceptions (although less than its treshold for automatically aborting) [18:34:55] uh, isn't that amir's patch [18:35:01] from a few minutes ago [18:35:07] and https://logstash.wikimedia.org/goto/72c98769c5b530cb766eddd499f20085 is the right link [18:35:10] scap confused me for a while [18:35:24] or maybe not [18:35:29] very likely, it touches that area [18:35:30] Amir1: ^^ [18:35:38] looking [18:35:40] urbanecm: from this patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/724109 [18:35:42] Amir1: ^ [18:35:55] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1068.eqiad.wmnet with reason: REIMAGE [18:35:56] that introduces the HtmlFormatter, which is supposedly not found [18:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:02] the use of HtmlFormatter* [18:36:07] ignore the HTMLFormatter [18:36:10] that's mwbdeug [18:36:11] that's only on the debug host though [18:36:26] Amir1: check https://logstash.wikimedia.org/goto/72c98769c5b530cb766eddd499f20085, that's from...the sec patch you deployed [18:36:26] I didn't properly apply the patch [18:36:31] yup yup [18:37:07] RECOVERY - Host sessionstore1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:37:35] 10SRE, 10MassMessage, 10Wikimedia-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Quiddity) @Snaevar Thanks for the report. Unfortunately, (per the comment above in T93049#6512485) the detailed-version of t... [18:38:02] PROBLEM - cassandra-a CQL 10.64.48.178:9042 on sessionstore1003 is CRITICAL: connect to address 10.64.48.178 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:38:55] (03CR) 10Ottomata: [C: 03+2] eventgate/_tls_helpers.tlp - make more like common template [deployment-charts] - 10https://gerrit.wikimedia.org/r/724137 (https://phabricator.wikimedia.org/T291856) (owner: 10Ottomata) [18:39:05] RECOVERY - cassandra-a CQL 10.64.48.178:9042 on sessionstore1003 is OK: TCP OK - 0.000 second response time on 10.64.48.178 port 9042 https://phabricator.wikimedia.org/T93886 [18:39:13] Amir1: I take it that you're fixing it? :-) [18:39:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:39:25] yup [18:39:38] okay, please hand over to otto.mata once done [18:39:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1070.eqiad.wmnet with reason: REIMAGE [18:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:58] syncing [18:40:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:40:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1071.eqiad.wmnet with reason: REIMAGE [18:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:18] !log Deployed patch for T284419 second time [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1070.eqiad.wmnet with reason: REIMAGE [18:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:21] ottomata: the floor is yours [18:42:38] (btw, let me know if you got any hints on how to debug the eventbus slowness issue) [18:42:40] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [18:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1071.eqiad.wmnet with reason: REIMAGE [18:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1068.eqiad.wmnet'] ` and were **ALL** success... [18:46:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:46:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1069.eqiad.wmnet with reason: REIMAGE [18:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:46:49] Amir1: dunno about job qeue slowness, but i am working with petr on enabling http error logging from envoy proxy [18:46:56] which is handling the requests before they get to eventgate [18:47:02] so maybe we'll get some clues there [18:47:03] but not sure [18:47:19] (03CR) 10Ottomata: [C: 03+2] EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724143 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:47:40] sure, I have no idea how these work, so if you give me some documentations, I'd be more than happy to help debugging [18:47:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1072.eqiad.wmnet with reason: REIMAGE [18:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:24] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1069.eqiad.wmnet with reason: REIMAGE [18:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:50:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1072.eqiad.wmnet with reason: REIMAGE [18:50:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1071.eqiad.wmnet'] ` and were **ALL** success... [18:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1070.eqiad.wmnet'] ` and were **ALL** success... [18:52:54] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) [18:52:56] !log otto@deploy1002 scap failed: average error rate on 6/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [18:52:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1073.eqiad.wmnet with reason: REIMAGE [18:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:13] ? [18:53:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1074.eqiad.wmnet with reason: REIMAGE [18:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:54:40] urbanecm: sorry for ping but not sure who to ask about ^^^^ [18:54:50] i think my scap sync failed for unrelated reasons to my patch [18:55:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:55:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1073.eqiad.wmnet with reason: REIMAGE [18:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:16] ottomata: if you see this kind of error message, REVERT IMMEDIATELY and ask questions later [18:55:25] ok doing [18:55:38] thanks [18:55:43] looking at logs [18:55:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:56:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:56:13] [8869681f-6869-4b84-bedb-d5a98e9d864c] /w/api.php PHP Notice: Undefined index: HTTP_X_CLIENT_IP [18:56:15] this is definitely related [18:56:19] ok then def related. [18:56:22] so yes, please revert :) [18:56:31] i didn't see that error, mostly just say UdpSocket loogs [18:56:33] reverting now [18:56:44] yeah, that link from scap is wrong for some reason [18:56:46] I'll fill a task [18:56:53] !log otto@deploy1002 Synchronized wmf-config/CommonSettings.php: REVERT: Enable x_client_ip_forwarding_enabled for eventgate-analytics and eventgate-analytics-external - T288853 (duration: 00m 56s) [18:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:59] ok reverted [18:57:00] T288853: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 [18:57:01] thank you [18:57:04] any time [18:57:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1074.eqiad.wmnet with reason: REIMAGE [18:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:16] RECOVERY - Device not healthy -SMART- on sessionstore1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sessionstore1003&var-datasource=eqiad+prometheus/ops [18:57:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:57:32] can you also push the revert to gerrit please? 🙂 [18:57:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1069.eqiad.wmnet'] ` and were **ALL** success... [18:57:46] urbanecm: does undefined index hard fail? or is that mostly a canary log checker to make sure things like that don't go out too far? [18:57:50] like, did requests die because of that? [18:57:58] yes, 22k of requests failed because of this [18:57:59] urbanecm: yes pushing [18:58:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` el... [18:58:04] growl. [18:58:07] ok [18:58:12] scap's canary check is the last check before the code goes out to _everyone_ [18:58:16] it's a portion of real appservers [18:58:18] aye [18:58:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [18:58:31] scaps checks fatals on those appservers before it proceeds to the fleet [18:58:47] if it exceeds a given threshold, it yells at the deployer 🙂 [18:58:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1072.eqiad.wmnet'] ` and were **ALL** success... [18:58:58] (03PS1) 10Ottomata: Revert "EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724112 [18:59:08] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: elastic1068, elastic1070, cumin1001, elastic1069, cumin2001, elastic1073 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:59:12] (03CR) 10Ryan Kemper: query_service: support multiple variants of wdqs microsite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [18:59:18] great [18:59:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1075.eqiad.wmnet with reason: REIMAGE [18:59:32] usually, it's caused by faulty patch. In theory it can happen as a coincidence too, but that's quite rare 🙂 [18:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1076.eqiad.wmnet with reason: REIMAGE [19:00:20] (03CR) 10Ottomata: [C: 03+2] Revert "EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724112 (owner: 10Ottomata) [19:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:37] urbanecm: shouldl I sync out the revert even tho its a no-op? [19:00:54] i thought you did already? [19:00:55] 20:56 <+logmsgbot> !log otto@deploy1002 Synchronized wmf-config/CommonSettings.php: REVERT: Enable x_client_ip_forwarding_enabled for eventgate-analytics and eventgate-analytics-external - T288853 (duration: 00m 56s) [19:01:05] urbanecm: i manually reset to previous commit on deploy server [19:01:10] i mean, shoudl i sync the gerrit revert [19:01:24] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/724112 [19:01:35] ah, if you're certain you synced the reverted version, just fetching it is fine :) [19:01:41] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1075.eqiad.wmnet with reason: REIMAGE [19:01:42] ok fetch and rebase? [19:01:44] yup [19:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:29] (actually, i was likely wrong in "requests die" -- it's a notice, not a fatal. Serious enough to revert anyway 🙂 ) [19:02:31] ya ok done, shoudl be good [19:02:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1077.eqiad.wmnet with reason: REIMAGE [19:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:40] great! [19:02:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1073.eqiad.wmnet'] ` and were **ALL** success... [19:03:00] indeed.. i was checked on mwdebug1001 and didn't see apparent problems like that, but did nto check logs [19:03:09] ok will follow up with the patch later, won't do any more of that today. [19:03:15] probably needs bug fix and train deployment [19:03:28] or backport :) [19:03:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:03:31] anyway, sounds good to me. [19:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:40] or backport, but i might be lazy (this is someone elses patch... :) ) [19:03:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1076.eqiad.wmnet with reason: REIMAGE [19:03:55] hehe [19:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:42] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) There is a bug in the code, so I had to revert my config patch. Hope to follow up this week. [19:05:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1074.eqiad.wmnet'] ` and were **ALL** success... [19:05:28] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [19:05:28] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1077.eqiad.wmnet with reason: REIMAGE [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:41] ottomata: also filled T291870 for the confusing link from the canary check [19:07:42] T291870: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 [19:08:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1078.eqiad.wmnet with reason: REIMAGE [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1079.eqiad.wmnet with reason: REIMAGE [19:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:36] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [19:08:37] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:04] thanks urbanecm [19:09:06] np [19:09:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1080.eqiad.wmnet with reason: REIMAGE [19:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:44] thanks for the ping :) [19:10:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1081.eqiad.wmnet with reason: REIMAGE [19:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:16] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1078.eqiad.wmnet with reason: REIMAGE [19:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1082.eqiad.wmnet with reason: REIMAGE [19:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1076.eqiad.wmnet'] ` and were **ALL** success... [19:11:22] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:11:23] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1083.eqiad.wmnet with reason: REIMAGE [19:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1075.eqiad.wmnet'] ` and were **ALL** success... [19:12:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1079.eqiad.wmnet with reason: REIMAGE [19:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1077.eqiad.wmnet'] ` and were **ALL** success... [19:13:43] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:13:43] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1083.eqiad.wmnet with reason: REIMAGE [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:34] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1080.eqiad.wmnet with reason: REIMAGE [19:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:15:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1081.eqiad.wmnet with reason: REIMAGE [19:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:08] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic1082.eqiad.wmnet with reason: REIMAGE [19:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:31] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:16:31] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1078.eqiad.wmnet'] ` and were **ALL** success... [19:19:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1079.eqiad.wmnet'] ` and were **ALL** success... [19:20:31] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1080.eqiad.wmnet'] ` and were **ALL** success... [19:22:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1083.eqiad.wmnet'] ` and were **ALL** success... [19:22:41] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:22:41] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1082.eqiad.wmnet'] ` and were **ALL** success... [19:23:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1081.eqiad.wmnet'] ` and were **ALL** success... [19:24:23] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:24:23] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:04] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10hashar) [19:26:05] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:02] (03PS1) 10Brennen Bearnes: Revert "Revert "gitlab cas: uid instead of CN; add nickname_key"" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/724160 (https://phabricator.wikimedia.org/T288392) [19:28:38] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:28:38] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:46] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:32:47] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:01] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724162 [19:45:18] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724163 [19:48:57] (03CR) 10Dzahn: "Hi Arnold, this change is the exact opposite of what we did earlier. It reverts the change. I created this with a simple click in the Gerr" [puppet] - 10https://gerrit.wikimedia.org/r/724111 (owner: 10Dzahn) [19:49:36] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724165 [19:53:42] (03CR) 10Dzahn: "Hi people in CC:, this was Arnold's first Gerrit commit, we went through it in a session. It is just a demo change and will be reverted. T" [puppet] - 10https://gerrit.wikimedia.org/r/724104 (owner: 10AOkoth) [20:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T2000). [20:00:40] !log ms-be2036 - remove commeeted out swift-drive-audit cron [20:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:46] (03CR) 10Dzahn: [C: 03+2] "checked: [cumin1001:~] $ sudo cumin 'ms-be*' 'crontab -u root -l | grep swift-drive-audit'" [puppet] - 10https://gerrit.wikimedia.org/r/724102 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [20:01:56] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Revert "Revert "gitlab cas: uid instead of CN; add nickname_key"" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/724160 (https://phabricator.wikimedia.org/T288392) (owner: 10Brennen Bearnes) [20:02:46] (03CR) 10Brennen Bearnes: Revert "Revert "gitlab cas: uid instead of CN; add nickname_key"" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/724160 (https://phabricator.wikimedia.org/T288392) (owner: 10Brennen Bearnes) [20:02:57] (03Abandoned) 10Brennen Bearnes: Revert "Revert "gitlab cas: uid instead of CN; add nickname_key"" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/724160 (https://phabricator.wikimedia.org/T288392) (owner: 10Brennen Bearnes) [20:03:07] (03CR) 10Dzahn: [C: 04-2] "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [20:03:18] (03CR) 10Dzahn: [C: 04-2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [20:04:38] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 5 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) >>! In T288853#7381469, @Ottomata wrote: > There is a bug in the code, so I had to revert my c... [20:06:09] !log gitlab1001: ~1hr downtime to attempt migration of usernames to shell uid (T288392) [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:14] T288392: GitLab uses 'real name' as username (rather than 'shell name' or an user-specified name) - https://phabricator.wikimedia.org/T288392 [20:08:29] (03CR) 10RLazarus: gitlab: test edit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724104 (owner: 10AOkoth) [20:19:45] (03Abandoned) 10Dzahn: hiera/appservers: remove mcrouter proxy values for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/708257 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [20:21:19] (03CR) 10Dzahn: gitlab: test edit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724104 (owner: 10AOkoth) [20:22:46] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:06] cloud K-Lined ? sigh [20:26:34] fixed after talking to libera and cloud [20:27:55] wm-bb caused the others to get kicked as well, shared IP [20:28:19] !log gitlab1001: done with user renames, restarting gitlab to apply session duration value after a reconfiguration [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:47] (03Abandoned) 10Brennen Bearnes: Revert "gitlab cas: uid instead of CN; add nickname_key" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/723230 (https://phabricator.wikimedia.org/T288392) (owner: 10Brennen Bearnes) [20:32:14] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10mepps) Yes, thank you @Urbanecm. I wanted @ERayfield to have access to the services documented in [[ https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#Specific_groups | this doc ]]. I'll upda... [20:34:00] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [20:37:13] (03CR) 10Dzahn: [C: 03+2] puppetmaster::rsync: replace data sync crons with timers/jobs [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:37:32] (03CR) 10Dzahn: [C: 03+2] "these actually exist only on puppetmaster2001, checking" [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:38:52] (03PS2) 10Legoktm: Configure Timeline like most other extensions (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723648 [20:38:54] (03PS2) 10Legoktm: Configure Timeline like most other extensions (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723649 [20:38:56] (03PS2) 10Legoktm: Configure Timeline like most other extensions (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723650 [20:38:58] (03PS2) 10Legoktm: Set $wgTimelineFonts and send all Timeline generation to Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723651 (https://phabricator.wikimedia.org/T289226) [20:39:00] (03PS2) 10Legoktm: Remove obsolete Timeline configuration and fonts submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 [20:39:02] (03PS3) 10Legoktm: Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) [20:39:04] (03PS3) 10Legoktm: Only set tiff settings when $wmgUsePagedTiffHandler = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723051 [20:39:06] (03PS3) 10Legoktm: Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) [20:42:23] (03CR) 10Jforrester: "Wonderful to see." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723650 (owner: 10Legoktm) [20:42:36] !log [puppetmaster2001:~] $ sudo systemctl start sync-puppet-volatile [20:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:40] (03CR) 10Dzahn: "[puppetmaster2001:~] $ sudo systemctl start sync-puppet-volatile" [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:46:34] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02695 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:47:51] eh.. looking [20:47:52] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02294 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:48:15] (03CR) 10Legoktm: [C: 03+2] Configure Timeline like most other extensions (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723648 (owner: 10Legoktm) [20:48:57] legoktm: ❤️ [20:49:07] (03Merged) 10jenkins-bot: Configure Timeline like most other extensions (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723648 (owner: 10Legoktm) [20:49:35] oh.. this is not good. rsync between puppetmasters removed CA file, but puppet run adds it back [20:49:53] James_F: fingers crossed [20:50:06] :-) [20:50:33] 0~~/me around if i can help [20:50:39] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Configure Timeline like most other extensions (1/3) (duration: 00m 58s) [20:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:09] jbond: hey:) soo.. I started the sync of the CA dir.. same thing it should have always been doing. it's rsync with --delete.. before and after [20:51:43] jbond: and it pulls _on_ 2001 from 1001 but it's not right on 2001 for some reason [20:51:50] checking permissions [20:53:10] (03CR) 10Legoktm: [C: 03+2] Configure Timeline like most other extensions (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723649 (owner: 10Legoktm) [20:53:13] mutante: still catching up, whats the actual issue? [20:53:44] (let me know if I should stop deploying MW changes) [20:54:01] (03Merged) 10jenkins-bot: Configure Timeline like most other extensions (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723649 (owner: 10Legoktm) [20:54:06] jbond: SSL_read: tlsv1 alert unknown ca on codfw agents after master codfw pulled CA dir from master in eqiad [20:54:35] run puppet on any codfw client and you will see it [20:56:15] ack i see looking [20:56:20] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Configure Timeline like most other extensions (2/3) (duration: 00m 56s) [20:56:23] jbond: the content of /var/lib/puppet/server/ssl/ca isn't as it is on 1001 [20:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:43] jbond: I was going to manually run that rsync.. it's the same that _should_ have always been running though [20:57:41] (03CR) 10Legoktm: [C: 03+2] Configure Timeline like most other extensions (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723650 (owner: 10Legoktm) [20:57:42] mutante: please make a copy of the server dir on puppetmaster1001 just to be saf [20:57:49] jbond: i found the problem I think [20:57:50] but otherwise yes that sounds good [20:58:08] there is ${server}::puppet_volatile in BOTH jobs [20:58:12] copy/paste [20:58:28] (03Merged) 10jenkins-bot: Configure Timeline like most other extensions (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723650 (owner: 10Legoktm) [20:58:58] jbond: copied to /root/ca on 1001, creating fix [20:59:11] ack [20:59:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:58] !log legoktm@deploy1002 Synchronized wmf-config/: Configure Timeline like most other extensions (3/3) (duration: 00m 57s) [21:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T2100). [21:01:50] !log legoktm@deploy1002 Synchronized docroot/: Configure Timeline like most other extensions (4/3) (duration: 00m 56s) [21:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:27] (03PS1) 10Dzahn: Revert "puppetmaster::rsync: replace data sync crons with timers/jobs" [puppet] - 10https://gerrit.wikimedia.org/r/724113 [21:09:03] (03PS1) 10Jbond: Revert "puppetmaster::rsync: replace data sync crons with timers/jobs" [puppet] - 10https://gerrit.wikimedia.org/r/724114 [21:10:10] (03CR) 10Jbond: [C: 03+2] Revert "puppetmaster::rsync: replace data sync crons with timers/jobs" [puppet] - 10https://gerrit.wikimedia.org/r/724114 (owner: 10Jbond) [21:10:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:30] running puppet on 10 codfw mw hosts [21:12:47] (03Abandoned) 10Dzahn: Revert "puppetmaster::rsync: replace data sync crons with timers/jobs" [puppet] - 10https://gerrit.wikimedia.org/r/724113 (owner: 10Dzahn) [21:13:47] (03PS1) 10Dzahn: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724115 [21:14:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:04] (03CR) 10Legoktm: [C: 03+2] Set $wgTimelineFonts and send all Timeline generation to Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723651 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [21:15:57] the puppetmaster codfw issue should be resolving. puppet agent run worked [21:16:03] (03Merged) 10jenkins-bot: Set $wgTimelineFonts and send all Timeline generation to Shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723651 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [21:16:29] !log puppetmaster2001 - /usr/bin/rsync -avz --delete puppetmaster1001.eqiad.wmnet::puppet_ca /var/lib/puppet/server/ssl/ca [21:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:41] (03PS1) 10Jbond: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724116 [21:17:39] legoktm: could you please let me know once you're done deploying? [21:18:07] sure, I have 2 syncs then I'll take a break [21:18:41] ack [21:18:55] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set $wgTimelineFonts and send all Timeline generation to Shellbox (T289226) (1/2) (duration: 00m 56s) [21:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:01] T289226: Convert EasyTimeline extension to use Shellbox - https://phabricator.wikimedia.org/T289226 [21:20:10] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Set $wgTimelineFonts and send all Timeline generation to Shellbox (T289226) (2/2) (duration: 00m 56s) [21:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:53] urbanecm: done [21:20:58] thanks [21:21:02] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:58] ACKNOWLEDGEMENT - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer daniel_zahn CA sync https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:16] !log puppetmaster2001 systemctl reset-failed [21:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) [21:24:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) 05Open→03Resolved [21:25:08] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:47] 10SRE, 10ops-eqiad, 10Platform Engineering: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Cmjohnson) updated the BIOS, the idrac version is so old that I was not able to update it, I attempted to go back to the oldest version I had saved and it still failed. I do not expect... [21:25:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:57] legoktm: I'm done, needed to test sth that interferes with staging script [21:28:59] *dir [21:29:15] !log puppetmaster2001 - rm /usr/lib/systemd/system/sync-puppet-ca.* [21:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:23] cool :D will resume in a few minutes [21:30:33] (https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/makeSizeDBLists.php was that thing, obviously doesn't work, but the size dblists are...very outdated. Will file tasks.) [21:33:18] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001147 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:33:28] ^ ok.. phew [21:33:42] that was pretty bad but fixed [21:33:59] !log running `extensions/SecurePoll/cli/wm-scripts/makeGlobalVoterList.php` for MCDC elections [21:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:16] MCDC? [21:34:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:29] Movement Charter Drafting Committee, urbanecm [21:34:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:34:34] ah [21:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] the voterlist is the same size as the board one [21:34:45] it's like AC/DC, but for the Movement [21:34:48] :D [21:34:56] and way less cool let's be honest [21:35:12] i was looking at https://en.wikipedia.org/wiki/MCDC, and while it has things that can be elected, none of them looks movement-related [21:35:37] ah, I should have included the phab ticket number [21:35:47] https://en.wikipedia.org/wiki/MC/DC looks interesting [21:35:49] https://phabricator.wikimedia.org/T291668 [21:36:21] I should also have run this script in a screen, I didn't know it would take 9 hours [21:36:41] heh [21:36:41] is it safe to ctrl+c and restart in a screen? [21:36:52] !log puppetmaster2001 - systemctl disable sync-puppet-ca, systemctl unmask sync-puppet-ca, rm /usr/lib/systemd/system/sync-puppet-ca.*, systemctl stop sync-puppet-ca.timer [21:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:03] legoktm: in theory, all maint scripts should count with maint servers going down... [21:37:06] ...but yes, that's theory [21:37:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:37:51] I'm wondering if...there's a better way than to create prod tables for each movement-wide election [21:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:59] we had two of them this year only :-) [21:38:12] urbanecm: lol, I also wonder this [21:38:24] legoktm: it should be ok, I'll update the docs though [21:38:48] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:40:03] and i also wonder why manifests/realm.pp:$private_tables doesn't have to contain the "temp" tables, and nothing sends of alarms about unrecognized tables [21:40:05] so many questions [21:41:48] !log re-running `extensions/SecurePoll/cli/wm-scripts/makeGlobalVoterList.php` for MCDC elections (in a screen this time) (https://phabricator.wikimedia.org/T291668) [21:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:00] it looks like $private_tables is used to check for private tables where they should not exist, for monitoring alerts, but not necessarily that it means all private tables have to exist in there for it to work [21:42:41] DBAs always tell me to never create tables unless it's a) properly configured for labs replication b) private _and_ in private_tables [21:42:43] (i cancelled the other script ftr. not sure if we'll have issues with this one, should be ok though) [21:42:48] so...that's why i ask [21:45:25] (03CR) 10Legoktm: [C: 03+2] Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [21:53:42] (03PS4) 10Legoktm: Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) [21:53:44] (03PS4) 10Legoktm: Only set tiff settings when $wmgUsePagedTiffHandler = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723051 [21:54:05] (03CR) 10Legoktm: [C: 03+2] Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [21:56:39] I seem to have successfully confused Gerrit and myself what the order of the stack of patches is [21:56:52] (03PS4) 10Legoktm: Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) [21:58:36] (03CR) 10jerkins-bot: [V: 04-1] Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [21:59:39] (03PS5) 10Legoktm: Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) [21:59:52] (03Abandoned) 10Legoktm: Only set tiff settings when $wmgUsePagedTiffHandler = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723051 (owner: 10Legoktm) [22:00:26] (03CR) 10Legoktm: [C: 03+2] Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:00:40] legoktm: :-) [22:01:47] (03Merged) 10jenkins-bot: Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:01:58] there we go [22:02:06] urbanecm: I thought that SecurePoll tables were magically always ignored. Though now I write that down, I have no idea how that would happen. [22:11:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:03] after spending a while figuring out why the pdf log entries weren't showing up, I remembered I just merged the tiff change [22:13:35] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have PagedTiffHandler use Shellbox service on group0 wikis (T289228) (1/2) (duration: 00m 57s) [22:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:41] T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox - https://phabricator.wikimedia.org/T289228 [22:13:48] (03Abandoned) 10Dzahn: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724116 (owner: 10Jbond) [22:14:38] (03PS5) 10Legoktm: Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) [22:14:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:48] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Have PagedTiffHandler use Shellbox service on group0 wikis (T289228) (2/2) (duration: 00m 58s) [22:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:14] (03CR) 10Legoktm: [C: 03+2] Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:16:07] Hey all - mstyles and I are deploying a security patch for https://phabricator.wikimedia.org/T291696 now. [22:16:13] (03Merged) 10jenkins-bot: Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) (owner: 10Legoktm) [22:16:57] (03PS1) 10Legoktm: Have SyntaxHighlight use Shellbox on group1 wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724198 (https://phabricator.wikimedia.org/T289227) [22:17:51] sbassett: oh, I have one mw-config change staged but not yet synced. I'm going to pull it onto mwdebug1001 but won't sync it until y'all are done [22:18:11] legoktm: ok, sounds good. this'll be scapped in a minute. [22:23:03] !log deployed security patch for T291696 [22:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:13] legoktm: and deployed. looks stable too. [22:24:25] awesome, thanks [22:24:44] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:25:08] !log legoktm@deploy1002 sync-file aborted: Have PdfHandler use Shellbox service on group0 wikis (T289228) (duration: 00m 00s) [22:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:14] T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox - https://phabricator.wikimedia.org/T289228 [22:25:23] * legoktm was fixing message [22:25:58] (03PS2) 10Dzahn: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" and fix dest dir [puppet] - 10https://gerrit.wikimedia.org/r/724115 [22:26:08] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have PdfHandler use Shellbox service on group0 wikis (T289228) (1/2) (duration: 00m 57s) [22:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" and fix dest dir [puppet] - 10https://gerrit.wikimedia.org/r/724115 (owner: 10Dzahn) [22:27:25] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Have PdfHandler use Shellbox service on group0 wikis (T289228) (2/2) (duration: 00m 56s) [22:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:51] (03PS2) 10Legoktm: Have SyntaxHighlight use Shellbox on group1 wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724198 (https://phabricator.wikimedia.org/T289227) [22:30:03] (03CR) 10Legoktm: [C: 03+2] Have SyntaxHighlight use Shellbox on group1 wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724198 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:30:42] (03PS3) 10Dzahn: Revert "Revert "puppetmaster::rsync: replace data sync crons with timers/jobs"" [puppet] - 10https://gerrit.wikimedia.org/r/724115 [22:31:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:18] (03Merged) 10jenkins-bot: Have SyntaxHighlight use Shellbox on group1 wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724198 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:34:44] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have SyntaxHighlight use Shellbox on group1 wikis too (T289227) (duration: 00m 57s) [22:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:49] T289227: Convert SyntaxHighlight to use Shellbox - https://phabricator.wikimedia.org/T289227 [22:36:37] (03PS11) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [22:39:43] * legoktm is done for today [22:44:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:45] (03CR) 10Dzahn: [C: 04-1] microsites: Switch to wmflib::dir::mkdir_p (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724053 (owner: 10Muehlenhoff) [22:48:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:09] (03PS1) 10Legoktm: Bump shellbox-syntaxhighlight up to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724203 (https://phabricator.wikimedia.org/T289227) [22:50:35] (03CR) 10Legoktm: [C: 03+2] Bump shellbox-syntaxhighlight up to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724203 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:53:12] Just FYI - after deploying a security patch and updating T276237 - I noticed that /srv/mediawiki-staging/php-1.38.0-wmf.1/includes/specials/SpecialContributions.php was modified. [22:55:27] (03Merged) 10jenkins-bot: Bump shellbox-syntaxhighlight up to 6 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/724203 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:56:07] sbassett: could it be the security patch for T284419? [22:56:23] I pinged someone elsewhere about it [22:57:16] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [22:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:04] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [22:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210927T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:04:25] (03CR) 10Bstorm: [C: 03+1] "How incredibly weird. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [23:09:37] (03PS1) 10Urbanecm: Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) [23:10:40] (03PS2) 10Urbanecm: Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) [23:11:47] (03PS3) 10Urbanecm: Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) [23:12:41] (03PS4) 10Urbanecm: Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) [23:13:16] (03PS1) 10Krinkle: speed-tests: Add "Oceanic.enwiki.1046871765" snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724206 (https://phabricator.wikimedia.org/T124966) [23:13:25] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) (owner: 10Urbanecm) [23:14:10] (03Merged) 10jenkins-bot: Deploy Growth features to 100% of newcomers of small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724205 (https://phabricator.wikimedia.org/T291876) (owner: 10Urbanecm) [23:17:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1891d28b387bc5f7521907d31c04544d2aa271d8: Deploy Growth features to 100% of newcomers of small wikis (T291876) (duration: 00m 57s) [23:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:24] T291876: Growth: Enforce "less than 500 registrations" means "no A/B" testing in deployments - https://phabricator.wikimedia.org/T291876 [23:19:05] urbanecm: let me know when done, would like to push out the new speed test [23:19:18] Krinkle: go ahead when ready :) [23:19:38] okay [23:19:40] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) @MoritzMuehlenhoff I don't know is you saw my comment on the Sep 10th but i am having issue installing Bullseye. I am... [23:22:06] (03CR) 10Krinkle: [C: 03+2] speed-tests: Add "Oceanic.enwiki.1046871765" snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724206 (https://phabricator.wikimedia.org/T124966) (owner: 10Krinkle) [23:22:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [23:22:55] (03Merged) 10jenkins-bot: speed-tests: Add "Oceanic.enwiki.1046871765" snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724206 (https://phabricator.wikimedia.org/T124966) (owner: 10Krinkle) [23:27:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:16] !log krinkle@deploy1002 Synchronized docroot/wikipedia.org/speed-tests/: I82f0727e8eb5b6e5e3e873f9bf0155cbd8668b53 (duration: 00m 59s) [23:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:10] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1939 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers