[00:28:21] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:03:25] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:44:36] Are there any sysadmins about with VRTS info-en access? Someone's reported a connectivity issue (with a traceroute) [01:44:51] I tried to forward it to noc@ but Znuny won't let me. Ticket is https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=11830161 [02:02:55] AntiComposite: I would encourage them to file a task in phabricator, and include all the relevant information https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [02:04:13] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:02] 10Puppet, 10SRE: Upgrade Puppet to 5.5.21 - https://phabricator.wikimedia.org/T248168 (10MacFan4000) I will note that Puppet 5.5 has been declared end of life as of November 2020 [03:34:51] (03PS1) 10Andrew Bogott: Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) [03:36:18] (03CR) 10jerkins-bot: [V: 04-1] Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [03:44:17] (03PS2) 10Andrew Bogott: Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) [04:44:36] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on pc2012 - https://phabricator.wikimedia.org/T284845 (10Marostegui) 05Open→03Invalid This was the raid being built after the install: ` root@pc2012:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [04:44:47] 10SRE, 10Wikimedia-Mailing-lists: Please close the wmfkids@ mailing list - https://phabricator.wikimedia.org/T284683 (10greg) Go ahead and fully remove/delete archives. Thanks! [04:52:11] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 (10Marostegui) p:05Triage→03Medium a:03Papaul This is not correct, the RAID is ok and so is the BBU: Something strange happened, as the RAID was degraded but then recovered: ` [8023918.895348] megaraid_sas... [04:54:38] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on pc2014 - https://phabricator.wikimedia.org/T284849 (10Marostegui) 05Open→03Invalid This was the raid being built after the install: ` root@pc2014:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [04:59:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install pc2011-pc2014 - https://phabricator.wikimedia.org/T282482 (10Marostegui) Thank you! [05:02:46] (03PS1) 10Marostegui: pc201[1-4]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699580 (https://phabricator.wikimedia.org/T284825) [05:03:20] (03CR) 10Marostegui: [C: 03+2] pc201[1-4]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/699580 (https://phabricator.wikimedia.org/T284825) (owner: 10Marostegui) [05:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P16437 and previous config saved to /var/cache/conftool/dbconfig/20210614-051522-marostegui.json [05:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P16438 and previous config saved to /var/cache/conftool/dbconfig/20210614-051608-root.json [05:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P16439 and previous config saved to /var/cache/conftool/dbconfig/20210614-051930-marostegui.json [05:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16440 and previous config saved to /var/cache/conftool/dbconfig/20210614-052715-root.json [05:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 50%: Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P16441 and previous config saved to /var/cache/conftool/dbconfig/20210614-053112-root.json [05:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:02] 10SRE, 10DC-Ops, 10SRE-tools, 10netops, 10Patch-For-Review: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10Papaul) With the same lab environment, the same command will upgrade the BIOS just by changing the file. ` sud... [05:38:31] 10SRE: Deploy Elia MT key in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10Majavah) [05:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16442 and previous config saved to /var/cache/conftool/dbconfig/20210614-054219-root.json [05:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:40] 10SRE: Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10KartikMistry) [05:46:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 75%: Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P16443 and previous config saved to /var/cache/conftool/dbconfig/20210614-054615-root.json [05:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16444 and previous config saved to /var/cache/conftool/dbconfig/20210614-055723-root.json [05:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 100%: Repool db1099:3311 after schema change', diff saved to https://phabricator.wikimedia.org/P16445 and previous config saved to /var/cache/conftool/dbconfig/20210614-060119-root.json [06:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:01] (03CR) 10Ryan Kemper: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/698975 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [06:07:37] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:12:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: Repool db1113:3316 after schema change', diff saved to https://phabricator.wikimedia.org/P16446 and previous config saved to /var/cache/conftool/dbconfig/20210614-061226-root.json [06:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:58] (03PS2) 10KartikMistry: Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T276059) [06:24:58] (03CR) 10Elukey: [C: 04-1] "Sadly this chart seems to have a problem. All the pods come up correctly and knative looks healthy, but then if I try to deploy kfserving " [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [06:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 for schema change', diff saved to https://phabricator.wikimedia.org/P16447 and previous config saved to /var/cache/conftool/dbconfig/20210614-062554-marostegui.json [06:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repool db1180 after schema change', diff saved to https://phabricator.wikimedia.org/P16448 and previous config saved to /var/cache/conftool/dbconfig/20210614-063231-root.json [06:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:03] !log installing libwep security updates on buster [06:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:07] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:18] (03PS1) 10Volans: admin: add LDAP-only new accounts [puppet] - 10https://gerrit.wikimedia.org/r/699686 (https://phabricator.wikimedia.org/T284832) [06:47:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repool db1180 after schema change', diff saved to https://phabricator.wikimedia.org/P16449 and previous config saved to /var/cache/conftool/dbconfig/20210614-064734-root.json [06:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:41] (03PS1) 10Muehlenhoff: Add library hint for libwebp [puppet] - 10https://gerrit.wikimedia.org/r/699687 [06:52:54] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libwebp [puppet] - 10https://gerrit.wikimedia.org/r/699687 (owner: 10Muehlenhoff) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210614T0700) [07:01:19] !log restarting mw canaries to pick up libwebp security updates [07:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/699686 (https://phabricator.wikimedia.org/T284832) (owner: 10Volans) [07:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repool db1180 after schema change', diff saved to https://phabricator.wikimedia.org/P16450 and previous config saved to /var/cache/conftool/dbconfig/20210614-070238-root.json [07:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:17] (03CR) 10Volans: [C: 03+2] admin: add LDAP-only new accounts [puppet] - 10https://gerrit.wikimedia.org/r/699686 (https://phabricator.wikimedia.org/T284832) (owner: 10Volans) [07:14:20] 10SRE: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10Volans) p:05Triage→03Medium [07:15:45] !log restart blazegraph and depool wdqs1012 [07:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:13] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1012 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:17:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repool db1180 after schema change', diff saved to https://phabricator.wikimedia.org/P16451 and previous config saved to /var/cache/conftool/dbconfig/20210614-071742-root.json [07:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 for schema change', diff saved to https://phabricator.wikimedia.org/P16452 and previous config saved to /var/cache/conftool/dbconfig/20210614-071839-marostegui.json [07:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:57] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16453 and previous config saved to /var/cache/conftool/dbconfig/20210614-072520-root.json [07:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:39] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:27] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10valerio.bozzolan) >>! In T257066#7013930, @Jan.Kamenicek wrote: > While one would expect that such a crucial... [07:27:09] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2148 T284852', diff saved to https://phabricator.wikimedia.org/P16454 and previous config saved to /var/cache/conftool/dbconfig/20210614-072930-marostegui.json [07:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:39] T284852: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 [07:30:11] !log Reboot db2148 T284852 [07:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:33] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:53] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:07] 10SRE, 10DNS, 10Traffic, 10Wikimedia-GitHub, 10Release-Engineering-Team (Kanban): Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10hashar) That last change ( https://gerrit.wikimedia.org/r/661180 ) was required since we changed the GitHub organization page from https://www.wikim... [07:37:23] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 (10Marostegui) I have rebooted the host and everything came as normal, all disks online, raid optimal... Leaving this open until @Papaul confirms he wasn't touching these disks while on-site. [07:39:32] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10tstarling) We have a budget for it now, but nobody to actually do it. But the plan is to re-enable LilyPond... [07:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16455 and previous config saved to /var/cache/conftool/dbconfig/20210614-074024-root.json [07:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:43] (03CR) 10Amire80: [C: 03+1] "Looks good to me, but should also be verified by someone with good understanding of the intricacies of Wikidata language code handling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699540 (https://phabricator.wikimedia.org/T283168) (owner: 10Mbch331) [07:49:12] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/699691 [07:50:09] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/699691 (owner: 10Marostegui) [07:51:07] !log Depool clouddb1013 to upgrade mysql [07:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16456 and previous config saved to /var/cache/conftool/dbconfig/20210614-075528-root.json [07:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:47] (03PS1) 10Amire80: [WIP] Update autonyms for kea, ota, sjd in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) [07:55:49] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/699554 [07:56:19] (03PS1) 10Filippo Giunchedi: alertmanager: add libera.chat nickserv identify patterns [puppet] - 10https://gerrit.wikimedia.org/r/699693 [07:56:42] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/699554 (owner: 10Marostegui) [07:57:46] (03PS2) 10Filippo Giunchedi: alertmanager: add libera.chat nickserv identify patterns [puppet] - 10https://gerrit.wikimedia.org/r/699693 [07:58:08] seeking reviewers for ^ [08:05:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2148', diff saved to https://phabricator.wikimedia.org/P16458 and previous config saved to /var/cache/conftool/dbconfig/20210614-080552-marostegui.json [08:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:14] 10SRE: Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10Volans) p:05Triage→03Medium @KartikMistry the #sre tag is the right one ;) I guess you can gpg-encrypt them with my key and then either send them via email or drop them on a host where you can a... [08:09:11] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Repool db1168 after schema change', diff saved to https://phabricator.wikimedia.org/P16459 and previous config saved to /var/cache/conftool/dbconfig/20210614-081031-root.json [08:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1165 for schema change', diff saved to https://phabricator.wikimedia.org/P16460 and previous config saved to /var/cache/conftool/dbconfig/20210614-081239-marostegui.json [08:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:51] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:16:41] RECOVERY - etcd request latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:29:05] 10SRE, 10Language-Team (Language-2021-April-June): Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10KartikMistry) [08:29:33] (03CR) 10Jbond: [C: 03+1] "LGTM but adding Arial just in-case" [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [08:30:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [08:32:06] (03CR) 10Volans: [C: 03+1] "Ship it as long as 3.9 tests passes locally as CI doesn't yet test them ;)" [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [08:32:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/698512 (https://phabricator.wikimedia.org/T167306) (owner: 10Ayounsi) [08:36:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16461 and previous config saved to /var/cache/conftool/dbconfig/20210614-083614-root.json [08:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:41] (03CR) 10Jbond: [C: 03+1] "lgtm but see nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699255 (owner: 10Muehlenhoff) [08:37:20] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add libera.chat nickserv identify patterns [puppet] - 10https://gerrit.wikimedia.org/r/699693 (owner: 10Filippo Giunchedi) [08:42:50] 10SRE, 10DC-Ops, 10SRE-tools, 10netops, 10Patch-For-Review: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) > could be run from cumin servers which already have access to the mgmt network so that seems a good o... [08:46:17] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16462 and previous config saved to /var/cache/conftool/dbconfig/20210614-085118-root.json [08:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16463 and previous config saved to /var/cache/conftool/dbconfig/20210614-090622-root.json [09:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:17] (03CR) 10Muehlenhoff: Update sudo permission to use run-puppet-agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699255 (owner: 10Muehlenhoff) [09:08:22] (03PS2) 10Muehlenhoff: Update sudo permission to use run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/699255 [09:10:51] (03PS4) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) [09:12:39] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/699255 (owner: 10Muehlenhoff) [09:14:34] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish: remove ats-be migration leftover from varnishttfb [puppet] - 10https://gerrit.wikimedia.org/r/699377 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [09:15:16] (03CR) 10Jbond: [C: 03+1] Update sudo permission to use run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/699255 (owner: 10Muehlenhoff) [09:17:07] (03CR) 10Jbond: [C: 03+2] "I originally read this as been realted to wiki data hence adding Ariel, as this is wiki stats ill go ahead and merege, ping me if any issu" [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [09:19:57] (03CR) 10Ayounsi: [C: 03+2] Add Python 3.9 support [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [09:21:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Repool db1165 after schema change', diff saved to https://phabricator.wikimedia.org/P16464 and previous config saved to /var/cache/conftool/dbconfig/20210614-092125-root.json [09:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:49] (03CR) 10MMandere: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [09:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 for schema change', diff saved to https://phabricator.wikimedia.org/P16465 and previous config saved to /var/cache/conftool/dbconfig/20210614-092234-marostegui.json [09:22:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet [09:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [09:27:33] (03PS1) 10Jbond: O:netbox: switch netbox production serveres to CAS sso authentication [puppet] - 10https://gerrit.wikimedia.org/r/699712 [09:33:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/699712 (owner: 10Jbond) [09:33:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16466 and previous config saved to /var/cache/conftool/dbconfig/20210614-093329-root.json [09:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:30] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10hashar) We do not use the Debian package anymore. The one published for Stretch is from O... [09:42:51] (03CR) 10Ayounsi: [C: 03+2] Add profile::contact to multiple roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/699209 (owner: 10Ayounsi) [09:43:23] (03PS4) 10Jbond: (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) [09:45:02] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [09:47:47] (03PS5) 10Jbond: (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) [09:48:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16467 and previous config saved to /var/cache/conftool/dbconfig/20210614-094832-root.json [09:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:21] (03CR) 10MMandere: [C: 03+2] prometheus: Add dependency between varnish exporter and varnish service [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [09:50:37] (03PS6) 10Jbond: (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) [09:51:35] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Doing): build and import blubber package for buster and bullseye (which supports v4) - https://phabricator.wikimedia.org/T283891 (10MoritzMuehlenhoff) >>! In T283891#7154574, @hashar wrote: > I believe the Debian package h... [09:52:50] (03PS7) 10Jbond: (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) [09:53:24] (03CR) 10Dat Nguyen: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698751 (https://phabricator.wikimedia.org/T274157) (owner: 10Dat Nguyen) [09:53:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29878/console" [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [09:54:08] !log jbond@deploy1002 Started deploy [netbox/deploy@e9f2382]: deploy v2.10.4-wmf4 [09:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:19] (03CR) 10jerkins-bot: [V: 04-1] (DO NOT merge) P:sretest: test change to check hiera dot notation [puppet] - 10https://gerrit.wikimedia.org/r/697629 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [09:56:45] !log jbond@deploy1002 Finished deploy [netbox/deploy@e9f2382]: deploy v2.10.4-wmf4 (duration: 02m 37s) [09:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:08] (03PS3) 10Muehlenhoff: Update sudo permission to use run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/699255 [09:59:54] (03PS1) 10Giuseppe Lavagetto: Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 [09:59:56] (03PS1) 10Giuseppe Lavagetto: Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 [09:59:58] (03PS1) 10Giuseppe Lavagetto: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 [10:00:00] (03PS1) 10Giuseppe Lavagetto: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 [10:00:02] (03PS1) 10Giuseppe Lavagetto: Add the Kaniko driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699720 [10:00:37] (03CR) 10Muehlenhoff: [C: 03+2] Update sudo permission to use run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/699255 (owner: 10Muehlenhoff) [10:02:26] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: update the private repo pre-commit hook to check staged [puppet] - 10https://gerrit.wikimedia.org/r/699196 (https://phabricator.wikimedia.org/T278187) (owner: 10Jbond) [10:02:28] (03CR) 10jerkins-bot: [V: 04-1] Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 (owner: 10Giuseppe Lavagetto) [10:02:41] (03CR) 10jerkins-bot: [V: 04-1] Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto) [10:03:00] (03CR) 10jerkins-bot: [V: 04-1] Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [10:03:13] (03CR) 10jerkins-bot: [V: 04-1] Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 (owner: 10Giuseppe Lavagetto) [10:03:15] (03CR) 10jerkins-bot: [V: 04-1] Add the Kaniko driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699720 (owner: 10Giuseppe Lavagetto) [10:03:20] 10SRE, 10Language-Team (Language-2021-April-June): Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10Volans) @KartikMistry secrets received and added to the private puppet repository. Puppet run on the deployment server `deploy1002`. The keys: ` mt: E... [10:03:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16469 and previous config saved to /var/cache/conftool/dbconfig/20210614-100336-root.json [10:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:05] 10Puppet, 10SRE, 10SRE-tools, 10Patch-For-Review, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10jbond) 05Open→03Resolved a:03jbond I have deployed a fix to this, closing but please re-open if you still see... [10:13:29] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10jbond) 05Resolved→03Open reopening, the new hook prevents file deletions [10:18:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repool db1131 after schema change', diff saved to https://phabricator.wikimedia.org/P16471 and previous config saved to /var/cache/conftool/dbconfig/20210614-101839-root.json [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2001.codfw.wmnet [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:50] (03PS2) 10JMeybohm: Replace all consumers of docker-registry credentials with alias [labs/private] - 10https://gerrit.wikimedia.org/r/699414 [10:29:56] (03PS1) 10JMeybohm: Fix usernames for docker-registry httpbb test [labs/private] - 10https://gerrit.wikimedia.org/r/699724 [10:30:06] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix usernames for docker-registry httpbb test [labs/private] - 10https://gerrit.wikimedia.org/r/699724 (owner: 10JMeybohm) [10:31:56] (03PS1) 10JMeybohm: httpbb-test: Fix usernames for docker-registry httpbb test [puppet] - 10https://gerrit.wikimedia.org/r/699725 [10:33:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2001.codfw.wmnet [10:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:45] !log disable puppet on mc* hosts [10:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:14] (03CR) 10Effie Mouzeli: [C: 03+2] profile::memcached::instance: Add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [10:41:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29879/console" [puppet] - 10https://gerrit.wikimedia.org/r/699725 (owner: 10JMeybohm) [10:41:31] (03PS16) 10Effie Mouzeli: profile::memcached::instance: Add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) [10:45:03] (03PS2) 10Giuseppe Lavagetto: Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 [10:45:05] (03PS2) 10Giuseppe Lavagetto: Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 [10:45:07] (03PS2) 10Giuseppe Lavagetto: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 [10:45:09] (03PS2) 10Giuseppe Lavagetto: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 [10:45:11] (03PS2) 10Giuseppe Lavagetto: Add the Kaniko driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699720 [10:45:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2003.codfw.wmnet [10:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:07] 10SRE, 10Language-Team (Language-2021-April-June): Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10KartikMistry) >>! In T284887#7154670, @Volans wrote: > @KartikMistry secrets received and added to the private puppet repository. Puppet run on the deplo... [10:46:14] (03PS3) 10KartikMistry: Add support for Elia MT to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/699089 (https://phabricator.wikimedia.org/T276059) [10:46:34] (03PS1) 10Jbond: O:puppetmaster: only check added or modified files [puppet] - 10https://gerrit.wikimedia.org/r/699726 [10:46:43] !log enable puppet on mc* [10:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:51] (03CR) 10jerkins-bot: [V: 04-1] Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [10:46:53] (03CR) 10jerkins-bot: [V: 04-1] Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto) [10:46:58] (03CR) 10jerkins-bot: [V: 04-1] Add the Kaniko driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699720 (owner: 10Giuseppe Lavagetto) [10:47:25] (03CR) 10Jbond: [C: 03+2] "Will merge this now as its causing issues, however post-review most welcome" [puppet] - 10https://gerrit.wikimedia.org/r/699726 (owner: 10Jbond) [10:51:57] (03PS1) 10Effie Mouzeli: hieradata: enable tls on codfw gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/699727 (https://phabricator.wikimedia.org/T271967) [10:52:00] !log T283163: Adding "metric-out minimum-igp" to all internal/Confed BGP groups on CR routers. [10:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:04] T283163: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 [10:52:26] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable tls on codfw gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/699727 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [10:52:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2003.codfw.wmnet [10:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:06] (03PS2) 10Effie Mouzeli: hieradata: enable tls on codfw gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/699727 (https://phabricator.wikimedia.org/T271967) [10:59:57] 10Puppet, 10SRE, 10SRE-tools, 10User-jbond: Private puppet commit hook checks current state of folder, not what is staged - https://phabricator.wikimedia.org/T278187 (10jbond) 05Open→03Resolved Have updated the hook to only work on added or modified files [11:00:03] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) @Joe I think a //periodic job rebuilding the images// is implemented already. See [modules/docker/manifests/baseimages.pp#65](https://gerrit.wikimedia.org/r/plugins/gitiles/operat... [11:04:23] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1002/29881/mc-gp2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/699727 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [11:05:24] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable tls on codfw gutter pool [puppet] - 10https://gerrit.wikimedia.org/r/699727 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [11:07:36] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) Hi @Jelto this task is about `production-images` that, in insider jargon, means the images built on top of that base layer that is already being rebuilt every sunday. Those can be f... [11:09:40] !log restart memcached on codfw memcached gutter pool (mc-gp2* hosts) [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:25] PROBLEM - MariaDB Replica Lag: s2 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1092.78 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:16:17] RECOVERY - MariaDB Replica Lag: s2 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:17:58] (03PS1) 10Effie Mouzeli: hieradata: enable TLS for memcached on mc2019 [puppet] - 10https://gerrit.wikimedia.org/r/699730 [11:18:42] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable TLS for memcached on mc2019 [puppet] - 10https://gerrit.wikimedia.org/r/699730 (owner: 10Effie Mouzeli) [11:19:26] (03PS2) 10Effie Mouzeli: hieradata: enable TLS for memcached on mc2019 [puppet] - 10https://gerrit.wikimedia.org/r/699730 (https://phabricator.wikimedia.org/T694484) [11:25:15] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable TLS for memcached on mc2019 [puppet] - 10https://gerrit.wikimedia.org/r/699730 (https://phabricator.wikimedia.org/T694484) (owner: 10Effie Mouzeli) [11:25:36] (03PS3) 10Effie Mouzeli: hieradata: enable TLS for memcached on mc2019 [puppet] - 10https://gerrit.wikimedia.org/r/699730 (https://phabricator.wikimedia.org/T694484) [11:27:51] (03CR) 10Effie Mouzeli: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/29883/" [puppet] - 10https://gerrit.wikimedia.org/r/699730 (https://phabricator.wikimedia.org/T694484) (owner: 10Effie Mouzeli) [11:28:54] !log restart memcached on mc2019 [11:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:12] 10SRE, 10Traffic, 10netops: BGP Policy on aggregate routes prevents them being created in some circumstances. - https://phabricator.wikimedia.org/T283163 (10cmooney) This configuration has been rolled out now across all CR routers. All looks ok, some slight increase in traffic in via eqord, and slight decre... [11:42:43] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10Volans) p:05Triage→03Medium [11:44:45] 10SRE, 10Language-Team (Language-2021-April-June): Deploy Elia MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T284887 (10Volans) 05Open→03Resolved a:03Volans Resolving as the secrets are available. Kartik will take care of the deployment when it's the right time. [11:46:34] 10SRE, 10SRE-Access-Requests: Add jgianellos and mbsantos to maps-root group - https://phabricator.wikimedia.org/T284135 (10MSantos) 05Open→03Resolved a:03Volans Thanks, @Volans! It's working just fine. [12:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16472 and previous config saved to /var/cache/conftool/dbconfig/20210614-120112-marostegui.json [12:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:53] RECOVERY - MariaDB memory on db1148 is OK: OK Memory 74% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:03:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2004.codfw.wmnet [12:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=redis_maps site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:08:19] PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 3 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:09:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:09:45] RECOVERY - Juniper alarms on cr2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:10:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2004.codfw.wmnet [12:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16473 and previous config saved to /var/cache/conftool/dbconfig/20210614-121031-marostegui.json [12:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1148', diff saved to https://phabricator.wikimedia.org/P16474 and previous config saved to /var/cache/conftool/dbconfig/20210614-121101-marostegui.json [12:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2005.codfw.wmnet [12:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:17] PROBLEM - Host cp3050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:17] PROBLEM - Host cp3051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:17] PROBLEM - Host cp3054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:17] PROBLEM - Host cp3053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:23] PROBLEM - Host ganeti3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:14:35] PROBLEM - Host lvs3005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:16:09] expected, power maintenance in esams [12:16:13] RECOVERY - Host lvs3005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.57 ms [12:17:38] !log configure OSPF link-protection on cr3-ulsfo:xe-0/1/1 - T167306 [12:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:43] T167306: ospf link-protection - https://phabricator.wikimedia.org/T167306 [12:19:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2005.codfw.wmnet [12:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:09] RECOVERY - Host cp3050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.58 ms [12:20:09] RECOVERY - Host cp3051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.53 ms [12:20:09] RECOVERY - Host cp3054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 108.13 ms [12:20:09] RECOVERY - Host cp3053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.55 ms [12:20:15] RECOVERY - Host ganeti3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.61 ms [12:20:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P16475 and previous config saved to /var/cache/conftool/dbconfig/20210614-122036-root.json [12:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1034 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16476 and previous config saved to /var/cache/conftool/dbconfig/20210614-122212-marostegui.json [12:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:25] !log re-pooling wdqs1012 [12:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to es1028 while es1034 gets upgraded', diff saved to https://phabricator.wikimedia.org/P16477 and previous config saved to /var/cache/conftool/dbconfig/20210614-122242-marostegui.json [12:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Restore es1028 original weight', diff saved to https://phabricator.wikimedia.org/P16478 and previous config saved to /var/cache/conftool/dbconfig/20210614-122322-marostegui.json [12:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:32] PROBLEM - Host cp3062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:23:48] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:24:38] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:26:30] PROBLEM - Host scs-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:28:42] RECOVERY - Host cp3062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.59 ms [12:28:58] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 118.68 ms [12:29:50] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 107.67 ms [12:31:44] RECOVERY - Host scs-oe16-esams is UP: PING OK - Packet loss = 0%, RTA = 107.43 ms [12:34:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: Repool es1034 after upgrade', diff saved to https://phabricator.wikimedia.org/P16479 and previous config saved to /var/cache/conftool/dbconfig/20210614-123427-root.json [12:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1033 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16480 and previous config saved to /var/cache/conftool/dbconfig/20210614-123512-marostegui.json [12:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P16481 and previous config saved to /var/cache/conftool/dbconfig/20210614-123539-root.json [12:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:02] (03PS1) 10Effie Mouzeli: hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 [12:36:38] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (owner: 10Effie Mouzeli) [12:37:14] (03PS2) 10Effie Mouzeli: hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T694484) [12:37:16] !log configure OSPF link-protection on cr3/4-ulsfo - T167306 [12:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:21] T167306: ospf link-protection - https://phabricator.wikimedia.org/T167306 [12:37:50] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T694484) (owner: 10Effie Mouzeli) [12:39:51] (03PS3) 10Effie Mouzeli: hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) [12:39:59] (03PS4) 10Effie Mouzeli: hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) [12:40:46] (03Abandoned) 10Effie Mouzeli: (WIP) hieradata: enable tls on mc2019 (3) [puppet] - 10https://gerrit.wikimedia.org/r/694484 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [12:45:01] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/compiler1003/29884/" [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [12:46:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: enable TLS for memcached on all codfw hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [12:49:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Repool es1034 after upgrade', diff saved to https://phabricator.wikimedia.org/P16482 and previous config saved to /var/cache/conftool/dbconfig/20210614-124931-root.json [12:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P16483 and previous config saved to /var/cache/conftool/dbconfig/20210614-125043-root.json [12:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:34] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 (10Papaul) @maqrostegui I haven't been on rack B8 for the pass 2 weeks so no I was not touching these disks while on-site [12:51:42] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable TLS for memcached on all codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [12:53:07] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable TLS for memcached on all codfw hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/699738 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [12:53:43] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 (10Marostegui) 05Open→03Resolved Thanks - closing this. It might have been a glitch. [12:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Repool es1033 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16484 and previous config saved to /var/cache/conftool/dbconfig/20210614-125442-root.json [12:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] 10SRE, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) Deployed to cr3 and cr4-ulsfo. Some interesting (and expected) finding: No brainer, backup from cr3-ulsfo to mr1-ulsfo is cr4-ulsfo `name=mr1-ulsfo cr3-ul... [12:55:45] jbond: thanks for the update [12:59:24] np :) [12:59:40] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10Papaul) [12:59:54] (03CR) 10DCausse: [C: 03+1] Add pool counter for automated search requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699257 (https://phabricator.wikimedia.org/T284479) (owner: 10Ebernhardson) [13:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Repool es1034 after upgrade', diff saved to https://phabricator.wikimedia.org/P16485 and previous config saved to /var/cache/conftool/dbconfig/20210614-130435-root.json [13:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repool db1174 after schema change', diff saved to https://phabricator.wikimedia.org/P16486 and previous config saved to /var/cache/conftool/dbconfig/20210614-130547-root.json [13:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16487 and previous config saved to /var/cache/conftool/dbconfig/20210614-130723-marostegui.json [13:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:06] 10SRE, 10DC-Ops, 10SRE-tools, 10netops, 10Patch-For-Review: Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution) - https://phabricator.wikimedia.org/T283771 (10Papaul) @jbond Yes uploading to IDRAC interfaces is very slow. [13:09:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Repool es1033 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16488 and previous config saved to /var/cache/conftool/dbconfig/20210614-130946-root.json [13:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:06] (03CR) 10Ottomata: [C: 03+1] archiva: Switch to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/699379 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:18:32] (03CR) 10Ottomata: [C: 03+1] archiva: Switch to profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/699378 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:19:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Repool es1034 after upgrade', diff saved to https://phabricator.wikimedia.org/P16489 and previous config saved to /var/cache/conftool/dbconfig/20210614-131938-root.json [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repool es1032 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16490 and previous config saved to /var/cache/conftool/dbconfig/20210614-132000-root.json [13:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] (03PS2) 10Ottomata: Finalize backend EP migration of 4 EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699002 (https://phabricator.wikimedia.org/T282855) [13:22:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170:3312 db1170:3317 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16491 and previous config saved to /var/cache/conftool/dbconfig/20210614-132235-marostegui.json [13:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:39] PROBLEM - Memcached on mc2026 is CRITICAL: connect to address 10.192.16.194 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:23:43] PROBLEM - Memcached on mc2037 is CRITICAL: connect to address 10.192.32.40 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:23:44] PROBLEM - Memcached on mc2031 is CRITICAL: connect to address 10.192.32.163 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:23:49] PROBLEM - Memcached on mc2034 is CRITICAL: connect to address 10.192.48.78 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:01] PROBLEM - Memcached on mc2020 is CRITICAL: connect to address 10.192.0.84 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:07] PROBLEM - Memcached on mc2021 is CRITICAL: connect to address 10.192.0.85 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:13] (03CR) 10Ottomata: [C: 03+2] Finalize backend EP migration of 4 EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699002 (https://phabricator.wikimedia.org/T282855) (owner: 10Ottomata) [13:24:14] PROBLEM - Memcached on mc2029 is CRITICAL: connect to address 10.192.32.161 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:21] PROBLEM - Memcached on mc2024 is CRITICAL: connect to address 10.192.16.61 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:21] PROBLEM - Memcached on mc2027 is CRITICAL: connect to address 10.192.32.159 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:24:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Repool es1033 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16492 and previous config saved to /var/cache/conftool/dbconfig/20210614-132449-root.json [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:37] PROBLEM - Memcached on mc2035 is CRITICAL: connect to address 10.192.48.79 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:25:37] PROBLEM - Memcached on mc2022 is CRITICAL: connect to address 10.192.0.86 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [13:28:51] effie: this is the new TLS port right? --^ [13:29:06] ah [13:29:11] I need to restart the servers [13:29:12] ok ok [13:29:25] please ignore, it is all me [13:29:34] !log restart memcached on codfw [13:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:00] perfect [13:30:06] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) So, getting into more details: - we usually build those images on deneb, using a script called `build-production-images`, which basically just runs docker-pkg from a virtualenv (see... [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2020 is CRITICAL: connect to address 10.192.0.84 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2021 is CRITICAL: connect to address 10.192.0.85 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2022 is CRITICAL: connect to address 10.192.0.86 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2023 is CRITICAL: connect to address 10.192.16.60 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2024 is CRITICAL: connect to address 10.192.16.61 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2026 is CRITICAL: connect to address 10.192.16.194 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:44] ACKNOWLEDGEMENT - Memcached on mc2027 is CRITICAL: connect to address 10.192.32.159 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:45] ACKNOWLEDGEMENT - Memcached on mc2029 is CRITICAL: connect to address 10.192.32.161 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:45] ACKNOWLEDGEMENT - Memcached on mc2031 is CRITICAL: connect to address 10.192.32.163 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:46] ACKNOWLEDGEMENT - Memcached on mc2034 is CRITICAL: connect to address 10.192.48.78 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:46] ACKNOWLEDGEMENT - Memcached on mc2035 is CRITICAL: connect to address 10.192.48.79 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:47] ACKNOWLEDGEMENT - Memcached on mc2037 is CRITICAL: connect to address 10.192.32.40 and port 11214: Connection refused Effie Mouzeli T271967 https://wikitech.wikimedia.org/wiki/Memcached [13:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 10%: Repool db1170:3312 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16493 and previous config saved to /var/cache/conftool/dbconfig/20210614-133156-root.json [13:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:05] RECOVERY - Memcached on mc2020 is OK: TCP OK - 0.034 second response time on 10.192.0.84 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 10%: Repool db1170:3317 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16494 and previous config saved to /var/cache/conftool/dbconfig/20210614-133210-root.json [13:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:14] RECOVERY - Memcached on mc2021 is OK: TCP OK - 0.032 second response time on 10.192.0.85 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:32:33] RECOVERY - Memcached on mc2024 is OK: TCP OK - 0.032 second response time on 10.192.16.61 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:32:34] RECOVERY - Memcached on mc2027 is OK: TCP OK - 0.033 second response time on 10.192.32.159 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:32:49] RECOVERY - Memcached on mc2022 is OK: TCP OK - 0.032 second response time on 10.192.0.86 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:33:01] RECOVERY - Memcached on mc2026 is OK: TCP OK - 0.032 second response time on 10.192.16.194 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:33:39] RECOVERY - Memcached on mc2029 is OK: TCP OK - 0.034 second response time on 10.192.32.161 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:34:23] RECOVERY - Memcached on mc2031 is OK: TCP OK - 0.033 second response time on 10.192.32.163 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:34:29] RECOVERY - Memcached on mc2034 is OK: TCP OK - 0.033 second response time on 10.192.48.78 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:34:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Repool es1034 after upgrade', diff saved to https://phabricator.wikimedia.org/P16495 and previous config saved to /var/cache/conftool/dbconfig/20210614-133442-root.json [13:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repool es1032 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16496 and previous config saved to /var/cache/conftool/dbconfig/20210614-133503-root.json [13:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] RECOVERY - Memcached on mc2035 is OK: TCP OK - 0.033 second response time on 10.192.48.79 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:35:45] RECOVERY - Memcached on mc2037 is OK: TCP OK - 0.033 second response time on 10.192.32.40 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [13:35:47] (03CR) 10Ottomata: [C: 03+2] Make kafka cumin aliases consistent and complete [puppet] - 10https://gerrit.wikimedia.org/r/699415 (owner: 10Ottomata) [13:36:40] (03CR) 10Ottomata: [C: 03+2] sre/kafka/* update kafka cluster choices [cookbooks] - 10https://gerrit.wikimedia.org/r/699418 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata) [13:37:55] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16497 and previous config saved to /var/cache/conftool/dbconfig/20210614-133801-marostegui.json [13:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Repool es1033 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16498 and previous config saved to /var/cache/conftool/dbconfig/20210614-133953-root.json [13:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:51] (03PS2) 10Bartosz Dziewoński: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 (owner: 10Esanders) [13:41:59] (03PS3) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) [13:43:53] !log otto@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [13:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:38] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] httpbb-test: Fix usernames for docker-registry httpbb test [puppet] - 10https://gerrit.wikimedia.org/r/699725 (owner: 10JMeybohm) [13:46:07] (03CR) 10Klausman: "On top of the minor bits here, there seem to be a whole bunch of files with no NL at EOF. I dunno how WMF feels about that in general. I s" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [13:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Repool db1170:3312 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16499 and previous config saved to /var/cache/conftool/dbconfig/20210614-134700-root.json [13:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 25%: Repool db1170:3317 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16500 and previous config saved to /var/cache/conftool/dbconfig/20210614-134713-root.json [13:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repool es1032 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16501 and previous config saved to /var/cache/conftool/dbconfig/20210614-135007-root.json [13:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 10%: Repool db1147 after upgrade', diff saved to https://phabricator.wikimedia.org/P16502 and previous config saved to /var/cache/conftool/dbconfig/20210614-135025-root.json [13:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] (03PS1) 10JMeybohm: httpbb-test: docker-registry no WWW-Authenticate on login [puppet] - 10https://gerrit.wikimedia.org/r/699751 [13:54:36] (03CR) 10JMeybohm: [C: 03+2] httpbb-test: docker-registry no WWW-Authenticate on login [puppet] - 10https://gerrit.wikimedia.org/r/699751 (owner: 10JMeybohm) [13:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Repool es1033 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16503 and previous config saved to /var/cache/conftool/dbconfig/20210614-135456-root.json [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:53] (03PS1) 10Jelto: add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) [13:58:25] (03CR) 10jerkins-bot: [V: 04-1] add job to weekly rebuild production-images [puppet] - 10https://gerrit.wikimedia.org/r/699752 (https://phabricator.wikimedia.org/T284431) (owner: 10Jelto) [14:01:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2006.codfw.wmnet [14:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 50%: Repool db1170:3312 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16504 and previous config saved to /var/cache/conftool/dbconfig/20210614-140203-root.json [14:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 50%: Repool db1170:3317 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16505 and previous config saved to /var/cache/conftool/dbconfig/20210614-140217-root.json [14:02:19] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10JMeybohm) [14:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:29] (03PS1) 10Gerrit maintenance bot: Add shi to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) [14:05:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repool es1032 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16506 and previous config saved to /var/cache/conftool/dbconfig/20210614-140511-root.json [14:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Repool db1147 after upgrade', diff saved to https://phabricator.wikimedia.org/P16507 and previous config saved to /var/cache/conftool/dbconfig/20210614-140529-root.json [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2006.codfw.wmnet [14:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps2007.codfw.wmnet [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] (03CR) 10JMeybohm: [C: 03+1] "Looks pretty reasonable!" [puppet] - 10https://gerrit.wikimedia.org/r/699726 (owner: 10Jbond) [14:10:44] (03PS1) 10Ottomata: Migrate CentralNotice{BannerHistory,Impression} to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699759 (https://phabricator.wikimedia.org/T271168) [14:11:41] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10mepps) @Volans Signed! [14:12:43] (03PS2) 10Ottomata: Migrate CentralNotice{BannerHistory,Impression} to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699759 (https://phabricator.wikimedia.org/T271168) [14:13:36] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10marcella) Approved as @mepps's manager. Thank you! [14:15:32] (03CR) 10Ottomata: [C: 03+2] Migrate CentralNotice{BannerHistory,Impression} to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699759 (https://phabricator.wikimedia.org/T271168) (owner: 10Ottomata) [14:17:05] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate CentralNotice{BannerHistory,Impression} to EventGate on testwiki - T271168 (duration: 00m 57s) [14:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Repool db1170:3312 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16508 and previous config saved to /var/cache/conftool/dbconfig/20210614-141707-root.json [14:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:10] T271168: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 [14:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 75%: Repool db1170:3317 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16509 and previous config saved to /var/cache/conftool/dbconfig/20210614-141720-root.json [14:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repool es1032 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16510 and previous config saved to /var/cache/conftool/dbconfig/20210614-142014-root.json [14:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Repool db1147 after upgrade', diff saved to https://phabricator.wikimedia.org/P16511 and previous config saved to /var/cache/conftool/dbconfig/20210614-142032-root.json [14:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2007.codfw.wmnet [14:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:19] (03PS1) 10Ottomata: Migrate CentralNotice{BannerHistory,Impression} to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699762 (https://phabricator.wikimedia.org/T271168) [14:25:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1001.eqiad.wmnet [14:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:50] (03PS4) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [14:26:22] (03CR) 10Ottomata: [C: 03+2] Migrate CentralNotice{BannerHistory,Impression} to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699762 (https://phabricator.wikimedia.org/T271168) (owner: 10Ottomata) [14:27:47] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate CentralNotice{BannerHistory,Impression} to EventGate on all wikis - T271168 (duration: 00m 57s) [14:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:53] T271168: CentralNoticeBannerHistory and CentralNoticeImpression Event Platform Migration - https://phabricator.wikimedia.org/T271168 [14:30:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1001.eqiad.wmnet [14:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:40] (03PS5) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Repool db1170:3312 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16512 and previous config saved to /var/cache/conftool/dbconfig/20210614-143211-root.json [14:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: Repool db1170:3317 after kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16513 and previous config saved to /var/cache/conftool/dbconfig/20210614-143224-root.json [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1002.eqiad.wmnet [14:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:02] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:35:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Repool db1147 after upgrade', diff saved to https://phabricator.wikimedia.org/P16514 and previous config saved to /var/cache/conftool/dbconfig/20210614-143536-root.json [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:36] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:39:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1002.eqiad.wmnet [14:39:49] (03PS6) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [14:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:13] (03PS8) 10Jbond: O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [14:40:25] (03PS1) 10Effie Mouzeli: hieradata: Replace mcrouter proxies with codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/699764 (https://phabricator.wikimedia.org/T271967) [14:41:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P16515 and previous config saved to /var/cache/conftool/dbconfig/20210614-144130-marostegui.json [14:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:13] (03PS2) 10Effie Mouzeli: hieradata: Replace mcrouter proxies with codfw hosts on mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/699764 (https://phabricator.wikimedia.org/T271967) [14:42:19] (03PS4) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [14:42:33] (03CR) 10Filippo Giunchedi: "Dave: are there updates on this and/or things we can assist with?" [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [14:43:28] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.con templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:43:44] 10SRE, 10vm-requests: Site: 2 VM request for an-airflow100{2,3} - https://phabricator.wikimedia.org/T284934 (10razzi) [14:44:59] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [14:45:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1003.eqiad.wmnet [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:57] (03Abandoned) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [14:46:31] (03Abandoned) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [14:46:58] (03Abandoned) 10Filippo Giunchedi: raid: report PDs from get-raid-status-hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [14:47:01] 10SRE, 10vm-requests: Site: 2 VM request for an-airflow100{2,3} - https://phabricator.wikimedia.org/T284934 (10MoritzMuehlenhoff) Looks fine. Please create one of them in row B and the other one in row D to better balance out our resource usage. [14:48:17] 10SRE, 10vm-requests: Site: 2 VM request for an-airflow100{2,3} - https://phabricator.wikimedia.org/T284934 (10razzi) a:03razzi Sounds good @MoritzMuehlenhoff, will do. [14:50:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Repool db1147 after upgrade', diff saved to https://phabricator.wikimedia.org/P16516 and previous config saved to /var/cache/conftool/dbconfig/20210614-145039-root.json [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1003.eqiad.wmnet [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:44] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/compiler1003/29885/" [puppet] - 10https://gerrit.wikimedia.org/r/699764 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [14:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 10%: Repool db1142 after upgrade', diff saved to https://phabricator.wikimedia.org/P16517 and previous config saved to /var/cache/conftool/dbconfig/20210614-145243-root.json [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: Replace mcrouter proxies with codfw hosts on mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/699764 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [14:55:41] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: Replace mcrouter proxies with codfw hosts on mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/699764 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [14:55:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1004.eqiad.wmnet [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={redis_maps,swagger_check_citoid_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:03:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:03:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1004.eqiad.wmnet [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:25] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1002.eqiad.wmnet [15:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps1005.eqiad.wmnet [15:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Repool db1142 after upgrade', diff saved to https://phabricator.wikimedia.org/P16518 and previous config saved to /var/cache/conftool/dbconfig/20210614-150747-root.json [15:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps1005.eqiad.wmnet [15:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Repool db1142 after upgrade', diff saved to https://phabricator.wikimedia.org/P16519 and previous config saved to /var/cache/conftool/dbconfig/20210614-152250-root.json [15:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:34] !log otto@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [15:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:43] 10Puppet, 10observability, 10User-fgiunchedi: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10fgiunchedi) [15:24:49] 10Puppet, 10observability, 10User-fgiunchedi: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10lmata) a:03fgiunchedi [15:25:48] 10Puppet, 10observability, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10lmata) @jbond moving to radar for visibility, let me know if you would like our help . [15:29:07] 10Puppet, 10observability, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10jbond) @lmata, happy to help but definitely good to get review from someone in observability. some of this is already in place so may not bee needed [15:34:24] ^ thank you [15:34:41] (03CR) 10Ema: [C: 03+2] varnish: remove ats-be migration leftover from varnishttfb [puppet] - 10https://gerrit.wikimedia.org/r/699377 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:35:16] (03PS1) 10Joal: [WIP] Add Gobblin modules [puppet] - 10https://gerrit.wikimedia.org/r/699770 [15:36:13] ottomata: --^ my very little progress - Mostly for comments in templates :) [15:37:00] (03CR) 10Elukey: [C: 04-1] Add support for knative serving (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [15:37:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Repool db1142 after upgrade', diff saved to https://phabricator.wikimedia.org/P16520 and previous config saved to /var/cache/conftool/dbconfig/20210614-153754-root.json [15:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] (03PS5) 10Elukey: Add support for knative serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/699380 (https://phabricator.wikimedia.org/T278194) [15:46:27] (03CR) 1020after4: [C: 03+1] "Looks good, Shall I merge this?" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [15:47:18] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10lmata) [15:47:46] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10lmata) [15:48:02] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10lmata) [15:48:51] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10lmata) 05Open→03Resolved [15:48:54] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10lmata) [15:49:18] (03PS5) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [15:49:27] (03CR) 10Dzahn: "Thanks, I received the emails about puppet failures as well but did not get around yet to look at the reason and wouldn't have understood " [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [15:51:10] (03CR) 10jerkins-bot: [V: 04-1] Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [15:51:21] 10SRE, 10observability: Making centrallog syslog easier and faster to work with - https://phabricator.wikimedia.org/T254605 (10lmata) 05Open→03Resolved [15:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Repool db1142 after upgrade', diff saved to https://phabricator.wikimedia.org/P16521 and previous config saved to /var/cache/conftool/dbconfig/20210614-155258-root.json [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:45] (03CR) 10Dzahn: [C: 04-1] Phabricator: Disable setting lowest priority on tasks [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper) [15:54:15] (03PS6) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [15:54:59] (03CR) 10jerkins-bot: [V: 04-1] Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [15:55:25] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [15:56:16] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10herron) 05Open→03Resolved a:03herron [15:56:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1002.eqiad.wmnet [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:19] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [15:58:31] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) 05Open→03Resolved a:03herron [16:00:50] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10ssingh) [16:02:04] (03PS7) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [16:02:47] (03CR) 10jerkins-bot: [V: 04-1] Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [16:08:40] (03PS8) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [16:13:15] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10Volans) @wkandek given that this access refers to hosts managed by SRE ServiceOps what of the existing groups mentioned in T284773#7152739 should be used for this use case? [16:33:01] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10Urbanecm) >>! In T284773#7155785, @Volans wrote: > @wkandek given that this access refers to hosts managed by SRE ServiceOps what of the existing groups mentioned in T284773#7152739 should be... [16:34:42] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:38:18] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:39:07] (03PS2) 10Urbanecm: Add dag to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot) [16:39:22] (03CR) 10Urbanecm: [C: 03+1] Add shi to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/699754 (https://phabricator.wikimedia.org/T284885) (owner: 10Gerrit maintenance bot) [16:42:37] (03PS9) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [16:43:19] (03CR) 10jerkins-bot: [V: 04-1] Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [16:46:32] !log jforrester@deploy1002 Started deploy [integration/docroot@ca7af97]: Add mediawiki/tools/api-testing JSDoc to doc.wikimedia for T236915 [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:36] T236915: Expose mediawiki/tools/api-testing doc on doc.wikimedia.org - https://phabricator.wikimedia.org/T236915 [16:46:39] !log jforrester@deploy1002 Finished deploy [integration/docroot@ca7af97]: Add mediawiki/tools/api-testing JSDoc to doc.wikimedia for T236915 (duration: 00m 07s) [16:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:17] (03PS3) 10STran: Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699218 (https://phabricator.wikimedia.org/T283711) (owner: 10Wikitrent) [16:50:18] PROBLEM - Host elastic2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:54:30] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 32.26 ms [16:56:08] RECOVERY - Host elastic2043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [16:57:45] (03CR) 10STran: [C: 03+2] Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699218 (https://phabricator.wikimedia.org/T283711) (owner: 10Wikitrent) [16:58:31] (03Merged) 10jenkins-bot: Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699218 (https://phabricator.wikimedia.org/T283711) (owner: 10Wikitrent) [17:02:55] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [17:05:02] !log jforrester@deploy1002 Started deploy [integration/docroot@22061b6]: Actually add mediawiki/tools/api-testing JSDoc to doc.wikimedia for T236915 [17:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:06] T236915: Expose mediawiki/tools/api-testing doc on doc.wikimedia.org - https://phabricator.wikimedia.org/T236915 [17:05:09] !log jforrester@deploy1002 Finished deploy [integration/docroot@22061b6]: Actually add mediawiki/tools/api-testing JSDoc to doc.wikimedia for T236915 (duration: 00m 07s) [17:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:43] (03PS1) 10Hnowlan: maps: make maps1007 a buster replica of the new imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/699782 (https://phabricator.wikimedia.org/T269582) [17:08:34] (03PS10) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [17:09:38] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29886/console" [puppet] - 10https://gerrit.wikimedia.org/r/699782 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [17:14:48] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=elasticsearch file=device_smart.prom instance=elastic2043 job=node site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:19:30] (03PS11) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [17:20:53] (03CR) 10Dave Pifke: [C: 03+1] "I think this is ready to go. Thanks for your patience." [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [17:22:54] (03PS1) 10STran: Revert "Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699557 [17:25:35] (03CR) 10STran: [C: 03+2] Revert "Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699557 (owner: 10STran) [17:26:27] (03Merged) 10jenkins-bot: Revert "Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699557 (owner: 10STran) [17:29:25] (03PS1) 10STran: Revert "Revert "Enable $wgSecurePollSingleTransferableVoteEnabled on beta sites"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699558 [17:41:59] (03CR) 10MSantos: [C: 03+1] maps: make maps1007 a buster replica of the new imposm cluster [puppet] - 10https://gerrit.wikimedia.org/r/699782 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [17:56:24] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [18:04:28] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 105.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [18:05:30] (03PS1) 10Ottomata: Finalize backend migration of CentralNotice EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699786 (https://phabricator.wikimedia.org/T259163) [18:06:56] (03CR) 10jerkins-bot: [V: 04-1] Finalize backend migration of CentralNotice EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699786 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [18:10:18] (03PS2) 10Ottomata: Finalize backend migration of CentralNotice EL schemas [puppet] - 10https://gerrit.wikimedia.org/r/699786 (https://phabricator.wikimedia.org/T259163) [18:15:41] (03PS1) 10Ottomata: Enable canary events for NavigationTiming ext streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699789 (https://phabricator.wikimedia.org/T271208) [18:16:12] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:14] (03CR) 10Ottomata: "Hi Gilles! I should have enabled this long ago, eh?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699789 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [18:22:24] (03PS5) 10Jforrester: scap: Drop never-used 'sqldump' tool [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [18:22:33] (03Abandoned) 10Jforrester: sqldump: Don't use wfGetLB(), we're killing it off [puppet] - 10https://gerrit.wikimedia.org/r/698649 (owner: 10Jforrester) [18:22:55] (03CR) 10Jforrester: "OK, I've high-jacked this commit to drop the feature entirely." [puppet] - 10https://gerrit.wikimedia.org/r/692370 (owner: 10Zabe) [18:23:15] (03PS1) 10Razzi: airflow: Add host configuration for an-airflow1002 [puppet] - 10https://gerrit.wikimedia.org/r/699790 (https://phabricator.wikimedia.org/T284934) [18:30:41] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1003.eqiad.wmnet [18:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:37] (03PS1) 10Bstorm: wikireplicas: re-enable notifications for clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/699791 [18:38:22] (03PS1) 10Ottomata: analytics cluster - Remove more deb packages that sbould not be needed [puppet] - 10https://gerrit.wikimedia.org/r/699792 (https://phabricator.wikimedia.org/T275786) [18:42:31] (03CR) 10Ottomata: [C: 03+2] analytics cluster - Remove more deb packages that sbould not be needed [puppet] - 10https://gerrit.wikimedia.org/r/699792 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [18:45:42] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) >>! In T281344#7140212, @ssingh wrote: >>>! In T281344#7139967, @Volans wrote: >> Anything still pending here on the #sre-access-requests side? > > Thanks for checking! I updated pwstore as tha... [18:53:42] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10wkandek) Let's use the `restricted` group [19:12:26] (03CR) 10Bstorm: [C: 03+2] "This works fine in my tests. We'll need to do a deb package release and then rebuild all the k8s images. I'll set it to merge at least for" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [19:13:43] (03Merged) 10jenkins-bot: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [19:21:01] !log applying hotfix for T284397 and restarting php7.3-fpm on phab1001 [19:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:06] T284397: Unhandled Exception ("RuntimeException"): Undefined offset: 5 when trying to access T16235 - https://phabricator.wikimedia.org/T284397 [19:27:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1003.eqiad.wmnet [19:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:28] (03PS3) 10Majavah: Replace os.execv with subprocess.check_call [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 [19:30:30] (03PS9) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [19:43:27] (03PS1) 10Ssingh: admin: add mepps to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/699799 (https://phabricator.wikimedia.org/T284773) [19:47:23] (03CR) 10Ssingh: [C: 03+2] admin: add mepps to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/699799 (https://phabricator.wikimedia.org/T284773) (owner: 10Ssingh) [19:49:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10ssingh) 05Open→03Resolved a:03ssingh @mepps: You have been added to the `restricted` group. Please let us know if there are any questions, thanks! [20:10:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [20:31:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 110.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [20:36:09] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002 for mepps - https://phabricator.wikimedia.org/T284773 (10mepps) Thank you! I checked and I can connect :). [20:57:25] (03CR) 10Bstorm: "The only way I can think of to properly test this is to release it to toolsbeta. It needs to be tested on the grid (by being installed on " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [20:59:18] (03PS2) 10Razzi: airflow: Add host configuration for an-airflow100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/699790 (https://phabricator.wikimedia.org/T284934) [21:03:02] (03CR) 10Ottomata: [C: 03+1] airflow: Add host configuration for an-airflow100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/699790 (https://phabricator.wikimedia.org/T284934) (owner: 10Razzi) [21:03:14] (03CR) 10Razzi: [C: 03+2] airflow: Add host configuration for an-airflow100{2,3} [puppet] - 10https://gerrit.wikimedia.org/r/699790 (https://phabricator.wikimedia.org/T284934) (owner: 10Razzi) [21:21:48] (03CR) 10Bstorm: "Built a package to test. Uploading it to toolsbeta aptly..." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [21:22:16] (03CR) 10Bstorm: "> Patch Set 3:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [21:34:02] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:35:50] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:40:01] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@baeee47]: T261407 bulk_daemon: Deploy prioritized topics [21:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:06] T261407: Add a link engineering: Create event for event gate to update search index after obtaining link recommendations - https://phabricator.wikimedia.org/T261407 [21:40:50] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@baeee47]: T261407 bulk_daemon: Deploy prioritized topics (duration: 00m 49s) [21:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:20] (03CR) 10Ebernhardson: [C: 03+1] "No longer blocked, this is good to go" [puppet] - 10https://gerrit.wikimedia.org/r/697836 (https://phabricator.wikimedia.org/T261407) (owner: 10Ebernhardson) [22:04:18] (03PS1) 10Brennen Bearnes: disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) [22:06:26] (03CR) 10Jforrester: "Do you want to kill gitlab_default_projects_features_snippets as well?" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes) [22:08:33] (03PS2) 10Brennen Bearnes: disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) [22:08:54] (03CR) 10Brennen Bearnes: "> Patch Set 1:" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) (owner: 10Brennen Bearnes) [22:15:40] (03PS3) 10Brennen Bearnes: disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) [22:26:51] (03CR) 10Bstorm: "This version (with python3) runs webservice-runner fine for a python3.7 web service in toolsbeta Kubernetes. Now I've got to try running o" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [22:41:35] (03CR) 10Bstorm: "There's a problem. It is likely not this patch, but the one from python3 setup." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [22:44:07] (03PS1) 10Ebernhardson: mjolnir: Provide prioritized topics to bulk daemon [puppet] - 10https://gerrit.wikimedia.org/r/699814 (https://phabricator.wikimedia.org/T261407) [22:49:19] (03PS1) 10Bstorm: python3: fix encoding in grid output [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 [22:50:22] (03PS2) 10Ebernhardson: mjolnir: Provide prioritized topics to bulk daemon [puppet] - 10https://gerrit.wikimedia.org/r/699814 (https://phabricator.wikimedia.org/T261407) [22:50:46] (03CR) 10Bstorm: "The next pending release (with 100% more python 3) will break for grid engine because any grid commands always return bytes. This was test" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699815 (owner: 10Bstorm) [22:52:35] (03CR) 10Bstorm: "Ok, with Id9ed766b63764ef72843dda9ac, this tests out well for grid and k8s. I only have python test cases, so I hope and presume it will n" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697102 (owner: 10Majavah) [22:53:42] (03CR) 10Ebernhardson: "pcc looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/29888/" [puppet] - 10https://gerrit.wikimedia.org/r/699814 (https://phabricator.wikimedia.org/T261407) (owner: 10Ebernhardson) [23:29:05] (03PS4) 10Brennen Bearnes: disable issues & wikis by default on new projects [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699812 (https://phabricator.wikimedia.org/T264231) [23:33:08] (03PS1) 10Brennen Bearnes: CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461) [23:36:30] PROBLEM - snapshot of x1 in eqiad on alert1001 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2021-06-11 23:18:38 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:42:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:44:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:11] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263734 MB (15% inode=79%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [23:48:00] (03PS2) 10Brennen Bearnes: CAS: stop marking users as external [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/699819 (https://phabricator.wikimedia.org/T274461)