[00:01:23] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9570656 (10Bugreporter) >>! In T358044#9569964, @Peachey88 wrote: >>>! In T358044#9569601, @bvibber wrote: >> That's probably the way... [00:03:05] thx [00:03:20] !log zabe@deploy2002 Started scap: Backport for [[gerrit:1005701|block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore (T358208)]] [00:03:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:28] T358208: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'huwiki' - https://phabricator.wikimedia.org/T358208 [00:04:43] !log zabe@deploy2002 zabe: Backport for [[gerrit:1005701|block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore (T358208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:16] !log zabe@deploy2002 zabe: Continuing with sync [00:08:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T357189)', diff saved to https://phabricator.wikimedia.org/P57790 and previous config saved to /var/cache/conftool/dbconfig/20240223-000858-arnaudb.json [00:09:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:09:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:09:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [00:09:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T357189)', diff saved to https://phabricator.wikimedia.org/P57791 and previous config saved to /var/cache/conftool/dbconfig/20240223-000920-arnaudb.json [00:12:39] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user="Grandmaster Huon" . # T358022 [00:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:45] T358022: Server side upload for Grandmaster Huon - https://phabricator.wikimedia.org/T358022 [00:14:22] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:1005701|block: Pass wikiId to DatabaseBlock::getId in DatabaseBlockStore (T358208)]] (duration: 11m 02s) [00:14:28] T358208: Wikimedia\Assert\PreconditionException: Expected MediaWiki\Block\AbstractBlock to belong to the local wiki, but it belongs to 'huwiki' - https://phabricator.wikimedia.org/T358208 [00:25:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T357189)', diff saved to https://phabricator.wikimedia.org/P57793 and previous config saved to /var/cache/conftool/dbconfig/20240223-002547-arnaudb.json [00:25:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005537 [00:39:15] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005537 (owner: 10TrainBranchBot) [00:40:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P57794 and previous config saved to /var/cache/conftool/dbconfig/20240223-004054-arnaudb.json [00:47:24] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9570760 (10RoyZuo) your analysis makes sense. indeed there're also files cut off at 15 mb https://commons.wikimedia.org/w/index.php?sort=create_timestamp_de... [00:50:34] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9570761 (10RoyZuo) someone should do something about the buggy uploads. # find a way to check files within these boundaries of multiples of 5 mb for corru... [00:52:49] RECOVERY - cassandra-a SSL 10.64.0.130:7000 on restbase1035 is OK: SSL OK - Certificate restbase1035-a valid until 2026-02-20 21:33:45 +0000 (expires in 728 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [00:55:31] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:56:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P57795 and previous config saved to /var/cache/conftool/dbconfig/20240223-005601-arnaudb.json [00:59:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:02:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005537 (owner: 10TrainBranchBot) [01:03:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T357189)', diff saved to https://phabricator.wikimedia.org/P57796 and previous config saved to /var/cache/conftool/dbconfig/20240223-011107-arnaudb.json [01:11:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [01:11:18] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:11:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [01:11:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T357189)', diff saved to https://phabricator.wikimedia.org/P57797 and previous config saved to /var/cache/conftool/dbconfig/20240223-011128-arnaudb.json [01:13:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T357189)', diff saved to https://phabricator.wikimedia.org/P57798 and previous config saved to /var/cache/conftool/dbconfig/20240223-011347-arnaudb.json [01:15:43] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:16:30] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9570816 (10Bawolff) As an aside some historical discussion at https://static-codereview.wikimedia.org/MediaWiki/104687.html (for context, back in the day we... [01:28:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P57799 and previous config saved to /var/cache/conftool/dbconfig/20240223-012853-arnaudb.json [01:44:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P57800 and previous config saved to /var/cache/conftool/dbconfig/20240223-014400-arnaudb.json [01:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T357189)', diff saved to https://phabricator.wikimedia.org/P57801 and previous config saved to /var/cache/conftool/dbconfig/20240223-015907-arnaudb.json [01:59:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [02:09:05] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:03:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:15:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:20:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:02:52] 10SRE-swift-storage, 10UploadWizard: Internal error: The server could not save the temporary file - https://phabricator.wikimedia.org/T353068#9571017 (10Bawolff) Doesn't help that AssembleUploadChunksJob.php takes the specific (and even localized!) error and replaces it with a super generic one. Maybe we shoul... [05:03:41] (SystemdUnitFailed) firing: (4) ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:34] (03PS1) 10Giuseppe Lavagetto: Revert "admin: temporarily revoke legoktm's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/1005702 [05:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:16:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "admin: temporarily revoke legoktm's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/1005702 (owner: 10Giuseppe Lavagetto) [05:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:21:46] 10SRE-swift-storage, 10MediaWiki-Uploading, 10MW-1.42-notes (1.42.0-wmf.19; 2024-02-20), 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9571042 (10Bawolff) Ok, so new logs do... [06:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240223T0700) [07:19:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1031 T358180', diff saved to https://phabricator.wikimedia.org/P57802 and previous config saved to /var/cache/conftool/dbconfig/20240223-071952-root.json [07:19:59] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [07:23:39] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:42] (03PS1) 10Marostegui: es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005869 (https://phabricator.wikimedia.org/T358180) [07:27:59] (03CR) 10Marostegui: [C: 03+2] es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005869 (https://phabricator.wikimedia.org/T358180) (owner: 10Marostegui) [07:28:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1031.eqiad.wmnet with OS bookworm [07:40:41] !log Install 10.6.17 on pc1014 T357089 [07:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:47] T357089: Compile and package MariaDB 10.6.17 - https://phabricator.wikimedia.org/T357089 [07:42:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1031.eqiad.wmnet with reason: host reimage [07:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1031.eqiad.wmnet with reason: host reimage [07:46:24] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: fix WidespreadPuppetFailure logic for no resources [alerts] - 10https://gerrit.wikimedia.org/r/1005752 (https://phabricator.wikimedia.org/T357893) (owner: 10Filippo Giunchedi) [07:48:39] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:18] (03PS1) 10Marostegui: Revert "es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005703 [07:56:11] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9571093 (10LSobanski) [07:57:43] (03CR) 10Slyngshede: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240223T0800) [08:00:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1031.eqiad.wmnet with OS bookworm [08:05:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57803 and previous config saved to /var/cache/conftool/dbconfig/20240223-080528-root.json [08:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:19:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:20:24] !log rollout prometheus-rsyslog-exporter new version to remaining hosts, caching sites - T357616 [08:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:30] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [08:20:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57804 and previous config saved to /var/cache/conftool/dbconfig/20240223-082033-root.json [08:35:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57805 and previous config saved to /var/cache/conftool/dbconfig/20240223-083538-root.json [08:50:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57806 and previous config saved to /var/cache/conftool/dbconfig/20240223-085043-root.json [08:53:56] !log root@cumin2002 START - Cookbook sre.idm.logout Logging GoranSMilovanovic out of all services on: 8 hosts [08:54:02] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging GoranSMilovanovic out of all services on: 8 hosts [09:03:41] (SystemdUnitFailed) firing: (4) ferm.service on mw1457:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:06] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9571245 (10MoritzMuehlenhoff) 05Open→03Resolved @AndrewTavis_WMDE @Manuel I've removed the access al... [09:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57807 and previous config saved to /var/cache/conftool/dbconfig/20240223-090549-root.json [09:08:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:09:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:09:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T357189)', diff saved to https://phabricator.wikimedia.org/P57808 and previous config saved to /var/cache/conftool/dbconfig/20240223-090913-arnaudb.json [09:09:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:14:22] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9571256 (10MoritzMuehlenhoff) [09:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57809 and previous config saved to /var/cache/conftool/dbconfig/20240223-092053-root.json [09:35:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57810 and previous config saved to /var/cache/conftool/dbconfig/20240223-093559-root.json [09:38:38] 10SRE-swift-storage, 10MediaWiki-Uploading, 10MW-1.42-notes (1.42.0-wmf.19; 2024-02-20), 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9571312 (10Bawolff) I wonder if the sl... [09:52:43] 10SRE, 10User-aborrero: reimage cookbook: failure when updating netbox data from puppetdb on cloudvirt1033 - https://phabricator.wikimedia.org/T358099#9571337 (10aborrero) 05Open→03Invalid Maybe lets mark as invalid until we are able to reproduce it. [10:02:47] (03PS2) 10Filippo Giunchedi: rsyslog: add maxmessagesize to -receiver [puppet] - 10https://gerrit.wikimedia.org/r/1005949 (https://phabricator.wikimedia.org/T358317) [10:03:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 94 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:05:01] 10SRE-swift-storage, 10MediaWiki-Uploading, 10MW-1.42-notes (1.42.0-wmf.19; 2024-02-20), 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9571398 (10MatthewVernon) >>! In T2008... [10:05:15] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9571399 (10ayounsi) a:03ayounsi [10:07:09] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968#9571403 (10ayounsi) Can you update the key present on your mediawiki page as well ? Thanks [10:08:20] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:08:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 73 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:09:43] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1005949 (https://phabricator.wikimedia.org/T358317) (owner: 10Filippo Giunchedi) [10:13:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T357189)', diff saved to https://phabricator.wikimedia.org/P57811 and previous config saved to /var/cache/conftool/dbconfig/20240223-101348-arnaudb.json [10:13:55] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:15:40] (03CR) 10David Caro: openstack: nova: compute: drop version matrix split (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005763 (owner: 10Arturo Borrero Gonzalez) [10:16:48] (03PS1) 10Majavah: Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) [10:16:58] (03CR) 10Ayounsi: [C: 03+2] Enable forwarding more broadly and fix nftables bug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:17:15] (03CR) 10Ayounsi: [C: 03+2] Netbox report, reduce alerting spam (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) (owner: 10Ayounsi) [10:18:05] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:18:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:18:57] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:19:08] here we go again [10:19:11] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:19:13] hmmm [10:19:35] sigh, I'll take a look too [10:19:40] I just thought "why is my grafana dashboard not loading" [10:19:44] acked [10:19:52] jelto: Oh so *you* broke it [10:19:54] :p [10:20:08] tut tut! [10:20:22] probably :) [10:20:43] well, same as last time, CPU pinned at 100% [10:21:09] seriously though, grafana works for me, and still looking [10:21:31] (03CR) 10MVernon: "Inclined to leave the warning as-is "this server might have a sad filesystem", wheras the multiple-load critical one is "something is badl" [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:21:37] godog: should I leave it to you then? [10:21:39] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova: compute: drop version matrix split [puppet] - 10https://gerrit.wikimedia.org/r/1005763 [10:21:41] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova: compute: extend dependency on ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) [10:22:11] claime: yes I'll look into it [10:22:13] ack [10:22:56] (03PS3) 10Ayounsi: Add SameSite=Strict attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) [10:23:32] slightly better than last time I'd say, as in the OOM kicked in and hosts themselves remained available [10:23:35] (03CR) 10Ayounsi: Add SameSite=Strict attribute to NetworkProbeLimit cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) (owner: 10Ayounsi) [10:23:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:25] ^ page resolved [10:24:50] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9571471 (10Clement_Goubert) >>! In T358117#9570020, @thcipriani wrote: > ... > Running httpbb against an mwdebug server before roll... [10:26:20] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:26:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:28:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P57813 and previous config saved to /var/cache/conftool/dbconfig/20240223-102854-arnaudb.json [10:29:12] \/28 [10:29:23] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] "Merging for future deployment." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/980824 (https://phabricator.wikimedia.org/T308002) (owner: 10Ayounsi) [10:30:39] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: add maxmessagesize to -receiver [puppet] - 10https://gerrit.wikimedia.org/r/1005949 (https://phabricator.wikimedia.org/T358317) (owner: 10Filippo Giunchedi) [10:32:24] (03PS4) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) [10:33:49] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1434/console" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [10:35:24] (03CR) 10Fabfur: [V: 03+1] "Removed test hiera for PCC, it should have no diff now with production hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [10:41:16] !log running homer 'cr*eqiad*' commit 'T351074' && homer 'lsw1-f2-eqiad*' commit 'T351074' for jobrunners being migrated to k8s workers [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:42:21] (03CR) 10Ayounsi: Adjust network prepare-upgrade cookbook to use TCP 8080 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [10:44:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P57814 and previous config saved to /var/cache/conftool/dbconfig/20240223-104401-arnaudb.json [10:47:48] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9571538 (10AndrewTavis_WMDE) Thank you, @MoritzMuehlenhoff! Really grateful to have this finalized. I'll... [10:49:02] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1458.eqiad.wmnet|mw1467.eqiad.wmnet|mw1468.eqiad.wmnet|mw1483.eqiad.wmnet|mw1484.eqiad.wmnet|mw1485.eqiad.wmnet|mw1494.eqiad.wmnet),cluster=kubernetes,service=kubesvc [10:52:47] !log running homer 'cr*codfw*' commit 'T351074' for new appservers being migrated to k8s workers [10:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:53] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:54:24] 10SRE, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571567 (10MoritzMuehlenhoff) [10:58:16] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788#9571583 (10fgiunchedi) This happened again today, recovery was better in the sense that titan hosts themselves remained available, the OOM... [10:59:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T357189)', diff saved to https://phabricator.wikimedia.org/P57815 and previous config saved to /var/cache/conftool/dbconfig/20240223-105907-arnaudb.json [10:59:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [10:59:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:59:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [10:59:28] 10SRE, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571585 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [10:59:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57816 and previous config saved to /var/cache/conftool/dbconfig/20240223-105929-arnaudb.json [11:01:56] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2369.codfw.wmnet|mw2367.codfw.wmnet),cluster=kubernetes,service=kubesvc [11:05:34] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9571604 (10Jelto) [11:07:23] !log running `homer 'cr*codfw*' commit 'T351074'` for two more appservers becoming k8s workers [11:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:29] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:07:31] (03PS1) 10Stevemunene: superset: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) [11:14:52] (03CR) 10Cathal Mooney: [C: 03+2] Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) (owner: 10Cathal Mooney) [11:15:36] (03Merged) 10jenkins-bot: Change name of dhcp_relay var and use it to control CR IPv6 RAs also [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) (owner: 10Cathal Mooney) [11:16:38] (03PS1) 10Muehlenhoff: dumps::nfs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1005960 [11:20:25] (03PS1) 10Jelto: etherpad: stop etherpad service on etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/1005961 (https://phabricator.wikimedia.org/T316421) [11:20:27] (03PS1) 10Jelto: etherpad: start etherpad service on etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/1005962 (https://phabricator.wikimedia.org/T316421) [11:21:12] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9571665 (10Jelto) [11:22:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1005961 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:23:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1005962 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:27:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005960 (owner: 10Muehlenhoff) [11:29:26] (03PS1) 10Jelto: wmnet: switch etherpad to etherpad1004 [dns] - 10https://gerrit.wikimedia.org/r/1005963 (https://phabricator.wikimedia.org/T316421) [11:31:15] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9571672 (10Jelto) [11:32:16] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2384.codfw.wmnet|mw2385.codfw.wmnet),cluster=kubernetes,service=kubesvc [11:37:10] (03Abandoned) 10Arturo Borrero Gonzalez: openstack: nova: compute: drop version matrix split [puppet] - 10https://gerrit.wikimedia.org/r/1005763 (owner: 10Arturo Borrero Gonzalez) [11:37:21] (03Abandoned) 10Arturo Borrero Gonzalez: openstack: nova: compute: extend dependency on ceph.conf [puppet] - 10https://gerrit.wikimedia.org/r/1005764 (https://phabricator.wikimedia.org/T358101) (owner: 10Arturo Borrero Gonzalez) [11:42:02] (03CR) 10Cathal Mooney: [C: 03+2] Change name of dhcp_relay var and use it to control CR IPv6 RAs also (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1005772 (https://phabricator.wikimedia.org/T358220) (owner: 10Cathal Mooney) [11:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:12] !log STOP persistRevisionThreadItems on viwiki for T315510, had been throwing tons of errors since at least Wednesday [11:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:18] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [11:51:02] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9571713 (10Jelto) [11:51:59] (03CR) 10Jelto: "preparation for switch to new hardware on Monday" [dns] - 10https://gerrit.wikimedia.org/r/1005963 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:52:08] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki viwiki --current --all --touched-after=20230613000000 --start '["7939741"]' 2>&1 | tee ~/T315510-viwiki # in tmux [11:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:19] (03CR) 10Jelto: [V: 03+1] "preparation for switch to new hardware on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1005961 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:52:30] (03CR) 10Hnowlan: [C: 03+2] kubernetes: move all remaining eligible codfw jobrunners to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005786 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:52:36] (03CR) 10Jelto: [V: 03+1] "preparation for switch to new hardware on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1005962 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [11:56:53] 10SRE-swift-storage, 10Commons, 10UploadWizard, 10Patch-For-Review: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9571726 (10Bawolff) >>! In T350917#9571715, @gerritbot wrote: > Change 1005965 had a related patch set uploaded (by Brian Wolff; autho... [11:58:25] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["75194261"]' | tee -a ~/T315510-enwiki-2 # in tmux [11:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:01:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57818 and previous config saved to /var/cache/conftool/dbconfig/20240223-120129-arnaudb.json [12:01:49] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:02:18] (03PS1) 10Majavah: Remove toolschecker grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) [12:03:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:26] (03CR) 10CI reject: [V: 04-1] Remove toolschecker grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) (owner: 10Majavah) [12:03:44] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) (owner: 10Majavah) [12:04:06] (03PS2) 10Majavah: Remove toolschecker grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) [12:04:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:37] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2040 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:07:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:08:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:10:19] RECOVERY - Check whether ferm is active by checking the default input chain on mw2385 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:13:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:25] (03PS1) 10Muehlenhoff: pontoon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1005970 [12:15:33] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2351.codfw.wmnet with OS bullseye [12:15:38] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2353.codfw.wmnet with OS bullseye [12:15:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye [12:15:47] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2382.codfw.wmnet with OS bullseye [12:15:48] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2394.codfw.wmnet with OS bullseye [12:15:50] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2419.codfw.wmnet with OS bullseye [12:15:51] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2426.codfw.wmnet with OS bullseye [12:15:52] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2428.codfw.wmnet with OS bullseye [12:15:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571766 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye [12:15:57] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2444.codfw.wmnet with OS bullseye [12:16:00] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye [12:16:03] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571769 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye [12:16:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye [12:16:11] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye [12:16:13] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571771 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye [12:16:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P57819 and previous config saved to /var/cache/conftool/dbconfig/20240223-121635-arnaudb.json [12:17:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005970 (owner: 10Muehlenhoff) [12:17:55] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 96 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:18:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:18:25] (SystemdUnitFailed) firing: (7) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:23] (03PS1) 10Muehlenhoff: Remove obsolete entry [puppet] - 10https://gerrit.wikimedia.org/r/1005971 (https://phabricator.wikimedia.org/T349619) [12:21:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:23:18] ^ that's a reimage side-effect, not a problem [12:23:25] (SystemdUnitFailed) firing: (7) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:27] however, those mw-jobrunner errors are real - eventgate is having issues again [12:23:30] https://grafana.wikimedia.org/goto/tZV3EtoIk?orgId=1 [12:25:37] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:27:55] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 85 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:28:25] (SystemdUnitFailed) firing: (8) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:45] idk wth is happening with eventgate [12:31:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:31:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2353.codfw.wmnet with reason: host reimage [12:31:37] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2351.codfw.wmnet with reason: host reimage [12:31:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P57820 and previous config saved to /var/cache/conftool/dbconfig/20240223-123141-arnaudb.json [12:32:06] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2382.codfw.wmnet with reason: host reimage [12:32:11] Oooh I think I know [12:32:14] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage [12:32:17] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2419.codfw.wmnet with reason: host reimage [12:32:29] Some pods get oomkilled [12:32:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:32:30] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage [12:32:47] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2394.codfw.wmnet with reason: host reimage [12:32:48] claime: I was just noting that the memory limit for eventgate is 300mb [12:33:30] hnowlan: It's 600 for eventgate-main [12:34:08] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage [12:34:11] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2353.codfw.wmnet with reason: host reimage [12:36:18] I don't see many OOMKills in the pods themselves [12:36:25] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage [12:38:38] None of them got killed today though [12:38:59] at least in codfw [12:39:01] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2351.codfw.wmnet with reason: host reimage [12:41:03] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2382.codfw.wmnet with reason: host reimage [12:41:17] RECOVERY - Check whether ferm is active by checking the default input chain on mw2384 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:41:44] eventgate itself sees a constant stream of 500s even when this isn't happening (a few hundred a minute), which is a little concerning [12:41:50] hnowlan: there's some heap memory limit exceeded messages in the logs, as well as BadRequest / aborted request errors [12:42:50] I wonder if we have just raised the memory limit for the container, but not for the actual proceses [12:43:38] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2426.codfw.wmnet with reason: host reimage [12:43:47] Opensearch still killing my firefox smh [12:44:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 94 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:46:16] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2419.codfw.wmnet with reason: host reimage [12:46:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T357189)', diff saved to https://phabricator.wikimedia.org/P57821 and previous config saved to /var/cache/conftool/dbconfig/20240223-124648-arnaudb.json [12:46:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [12:46:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:47:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [12:47:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T357189)', diff saved to https://phabricator.wikimedia.org/P57822 and previous config saved to /var/cache/conftool/dbconfig/20240223-124710-arnaudb.json [12:49:36] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2428.codfw.wmnet with reason: host reimage [12:49:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 73 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:36] heap memory limit is 200MB/worker, num_workers is 0 which means one single process [12:50:54] So we raised the memory limit to 600MB for the container, but never for the actual process [12:51:21] I'd tray and raise worker_heap_limit_mb to 500 and see what happens [12:51:43] Anyone who knows node.js able to weigh in on that? [12:52:18] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2353.codfw.wmnet with OS bullseye [12:52:28] sounds reasonable to me, might be worth throwing a CR at data engineering? [12:52:32] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye completed: - mw2353 (**PASS**)... [12:52:52] hnowlan: in progress [12:53:07] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2394.codfw.wmnet with reason: host reimage [12:55:01] (03PS1) 10Clément Goubert: eventgate-main: Raise worker_heap_limit_mb to 500 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005974 (https://phabricator.wikimedia.org/T249745) [12:55:07] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2444.codfw.wmnet with OS bullseye [12:55:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye completed: - mw2444 (**PASS**)... [12:55:49] (HelmReleaseBadStatus) firing: Helm release superset-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=superset-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:56:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 110 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:57:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2351.codfw.wmnet with OS bullseye [12:57:43] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye completed: - mw2351 (**PASS**)... [13:00:49] (HelmReleaseBadStatus) resolved: Helm release superset-next/staging on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=superset-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:01:59] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 87 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:03:25] (SystemdUnitFailed) firing: (6) httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:40] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2382.codfw.wmnet with OS bullseye [13:04:52] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye completed: - mw2382 (**WARN**)... [13:05:13] (03CR) 10Hnowlan: "lgtm, but would appreciate review from someone who knows eventgate's internals better" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005974 (https://phabricator.wikimedia.org/T249745) (owner: 10Clément Goubert) [13:07:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:58] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2426.codfw.wmnet with OS bullseye [13:08:08] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye completed: - mw2426 (**WARN**)... [13:08:33] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2428.codfw.wmnet with OS bullseye [13:08:44] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye completed: - mw2428 (**PASS**)... [13:09:01] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 98 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:09:51] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2419.codfw.wmnet with OS bullseye [13:10:04] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571879 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye completed: - mw2419 (**WARN**)... [13:11:29] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts cumin1001.eqiad.wmnet [13:16:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:19:30] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2394.codfw.wmnet with OS bullseye [13:19:43] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#9571908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2394.codfw.wmnet with OS bullseye completed: - mw2394 (**WARN**)... [13:20:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cumin1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:22:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cumin1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:22:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cumin1001.eqiad.wmnet [13:22:54] 10SRE, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571910 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `cumin1001.eqiad.wmnet` - cumin1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [13:23:21] 10SRE, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571911 (10MoritzMuehlenhoff) [13:23:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin1001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1005755 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:24:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 89 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:25:00] 10SRE, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571567 (10MoritzMuehlenhoff) [13:26:02] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571914 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Jclark-ctr [13:27:58] 10SRE, 10Infrastructure-Foundations: Remove cumin1001 from router ACLs - https://phabricator.wikimedia.org/T353525#9571917 (10MoritzMuehlenhoff) It appears to me that the necessary steps are: - https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ - Commit to in Homer via: ` homer "cr*" commit "decom... [13:29:44] (03PS2) 10Clément Goubert: ferm: Check ferm.service status in ferm_status.py [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) [13:30:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) (owner: 10Majavah) [13:31:05] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 112 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:31:46] (03CR) 10Majavah: [C: 03+2] Remove toolschecker grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) (owner: 10Majavah) [13:33:39] 10SRE, 10Infrastructure-Foundations: Remove cumin1001 from router ACLs - https://phabricator.wikimedia.org/T353525#9571941 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:38:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1005970 (owner: 10Muehlenhoff) [13:39:11] (03CR) 10Filippo Giunchedi: "\o/ very nice" [puppet] - 10https://gerrit.wikimedia.org/r/1005967 (https://phabricator.wikimedia.org/T358333) (owner: 10Majavah) [13:41:00] (03PS1) 10Ayounsi: Add elinewmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005983 (https://phabricator.wikimedia.org/T357097) [13:47:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T357189)', diff saved to https://phabricator.wikimedia.org/P57823 and previous config saved to /var/cache/conftool/dbconfig/20240223-134727-arnaudb.json [13:47:33] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:48:51] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005986 (https://phabricator.wikimedia.org/T357089) [13:49:27] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005986 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:50:06] (03CR) 10CI reject: [V: 04-1] control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005986 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:50:43] That's weird [13:51:05] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 90 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:43] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005987 (https://phabricator.wikimedia.org/T357089) [13:52:36] (03CR) 10CI reject: [V: 04-1] control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005987 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:53:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097#9572006 (10ayounsi) a:03ayounsi [13:54:05] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: New version [software] - 10https://gerrit.wikimedia.org/r/1005989 [13:54:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097#9572005 (10ayounsi) User added to the NDA LDAP group. Only thing left is the patch above once reviewed. [13:54:55] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1005987 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:55:15] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: New version [software] - 10https://gerrit.wikimedia.org/r/1005989 (owner: 10Marostegui) [13:55:46] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: New version [software] - 10https://gerrit.wikimedia.org/r/1005989 (owner: 10Marostegui) [13:56:47] (03PS1) 10Ayounsi: Add arthurtaylor to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005991 (https://phabricator.wikimedia.org/T357147) [13:56:49] (03Abandoned) 10Marostegui: control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005987 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:56:54] (03Abandoned) 10Marostegui: control-mariadb-10.6-bullseye: Change version [software] - 10https://gerrit.wikimedia.org/r/1005986 (https://phabricator.wikimedia.org/T357089) (owner: 10Marostegui) [13:58:39] (JobUnavailable) firing: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:58:43] (03PS24) 10Slyngshede: P:mirrors::debian Export mirror age to textfile exporter [puppet] - 10https://gerrit.wikimedia.org/r/1003442 [14:02:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P57824 and previous config saved to /var/cache/conftool/dbconfig/20240223-140233-arnaudb.json [14:04:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147#9572026 (10ayounsi) a:03ayounsi [14:07:35] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9572029 (10hashar) > What do you want to use as the host name, something like zuul1001? I'd go with `contint1003`. Daniel mentioned using the... [14:11:53] (03CR) 10Cathal Mooney: Adjust reimage cookbook to clear switch caches for vms too (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [14:12:57] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2002.codfw.wmnet with OS bullseye [14:13:03] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9572032 (10ayounsi) For testing hosts I'd prefer running on private IPs as those tend to have puppet disabled for longer period of time and "e... [14:17:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P57825 and previous config saved to /var/cache/conftool/dbconfig/20240223-141740-arnaudb.json [14:19:44] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9572042 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [14:21:47] (03PS1) 10Majavah: P:puppetdb: microservice: add ensure to auto_restarts [puppet] - 10https://gerrit.wikimedia.org/r/1005997 [14:22:10] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005997 (owner: 10Majavah) [14:22:28] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [14:22:57] 10SRE, 10Infrastructure-Foundations, 10netops: Control IPv6 RA generation on core routers - https://phabricator.wikimedia.org/T358220#9572046 (10cmooney) 05Open→03Resolved [14:23:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9572047 (10cmooney) [14:23:16] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005997 (owner: 10Majavah) [14:23:45] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9476920 (10cmooney) [14:24:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005983 (https://phabricator.wikimedia.org/T357097) (owner: 10Ayounsi) [14:24:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [14:25:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868#9572048 (10cmooney) 05Open→03Resolved a:03cmooney closing, thanks for the help! [14:25:49] (03CR) 10Majavah: [C: 03+2] P:puppetdb: microservice: add ensure to auto_restarts [puppet] - 10https://gerrit.wikimedia.org/r/1005997 (owner: 10Majavah) [14:26:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005991 (https://phabricator.wikimedia.org/T357147) (owner: 10Ayounsi) [14:28:39] (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:08] (03CR) 10Ayounsi: [C: 03+2] Add elinewmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005983 (https://phabricator.wikimedia.org/T357097) (owner: 10Ayounsi) [14:32:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T357189)', diff saved to https://phabricator.wikimedia.org/P57826 and previous config saved to /var/cache/conftool/dbconfig/20240223-143246-arnaudb.json [14:32:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:32:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:33:03] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9572080 (10taavi) [14:33:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [14:33:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:33:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:33:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T357189)', diff saved to https://phabricator.wikimedia.org/P57827 and previous config saved to /var/cache/conftool/dbconfig/20240223-143337-arnaudb.json [14:34:32] (03PS2) 10Ayounsi: Add arthurtaylor to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005991 (https://phabricator.wikimedia.org/T357147) [14:34:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097#9572092 (10ayounsi) 05Stalled→03Resolved Give it ~30min for the change propagate and you should be good to go. Please let us know if there is any is... [14:36:21] (03CR) 10Ayounsi: [C: 03+2] Add arthurtaylor to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005991 (https://phabricator.wikimedia.org/T357147) (owner: 10Ayounsi) [14:37:19] (03PS1) 10Majavah: P:docker::prune: fix hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1005999 [14:37:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [14:38:39] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:46] (03PS1) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) [14:40:04] (03CR) 10CI reject: [V: 04-1] thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [14:40:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147#9572103 (10ayounsi) 05Stalled→03Resolved Give it ~30min for the change propagate and you should be good to go. Please let us know if there is an... [14:42:04] !log running `homer 'cr*codfw*' commit 'T354791'` for reclaimed codfw jobrunners moving to k8s workers [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:10] T354791: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 [14:43:25] (SystemdUnitFailed) firing: (6) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:30] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update [14:43:43] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update (duration: 00m 12s) [14:48:25] (SystemdUnitFailed) firing: (8) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:29] !log hnowlan@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=(mw2351.codfw.wmnet|mw2353.codfw.wmnet|mw2382.codfw.wmnet|mw2394.codfw.wmnet|mw2419.codfw.wmnet|mw2426.codfw.wmnet|mw2428.codfw.wmnet|mw2444.codfw.wmnet),cluster=kubernetes,service=kubesvc [14:49:00] (03PS1) 10Ayounsi: Set ENFORCE_GLOBAL_UNIQUE to True [puppet] - 10https://gerrit.wikimedia.org/r/1006001 (https://phabricator.wikimedia.org/T336275) [14:49:39] (03CR) 10Majavah: "We don't write the file on every maintain-dbusers run. The write endpoint is only called when maintain-dbusers thinks the password has cha" [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [14:50:22] (03PS4) 10Majavah: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) [14:50:55] (03PS2) 10Ayounsi: Netbox: set ENFORCE_GLOBAL_UNIQUE to True [puppet] - 10https://gerrit.wikimedia.org/r/1006001 (https://phabricator.wikimedia.org/T336275) [14:53:25] (SystemdUnitFailed) firing: (9) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:49] 10SRE, 10Infrastructure-Foundations: Remove cumin1001 from router ACLs - https://phabricator.wikimedia.org/T353525#9572159 (10MoritzMuehlenhoff) 05Open→03Resolved The Homer config change was generated and deployed (and along with it, the rules for the new apt2002 server were also distributed) [14:53:51] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419#9572161 (10MoritzMuehlenhoff) [14:53:54] (03PS1) 10Jelto: etherpad: disable auto restart when etherpad is stopped [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) [14:54:05] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [14:54:37] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419#9572163 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All completed, the only missing part if the physical decommission steps, which have a separate task. [14:55:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:56:57] (03CR) 10Ssingh: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [14:57:34] (03CR) 10Muehlenhoff: etherpad: disable auto restart when etherpad is stopped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [14:58:39] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:59] PROBLEM - Check whether ferm is active by checking the default input chain on mw2382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:59:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [15:01:12] (03CR) 10Eevans: [C: 03+1] Remove obsolete entry [puppet] - 10https://gerrit.wikimedia.org/r/1005971 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:01:49] (03PS2) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) [15:01:51] (03PS1) 10Filippo Giunchedi: thanos: limit query length in frontend [puppet] - 10https://gerrit.wikimedia.org/r/1006006 (https://phabricator.wikimedia.org/T356788) [15:02:42] (stashbot was gone for a few minutes, jhancock might want to repeat that !log message if it was important) [15:03:01] PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [15:03:03] (03CR) 10CI reject: [V: 04-1] thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:03:07] (03PS2) 10Jelto: etherpad: disable auto restart when etherpad is stopped [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) [15:04:03] RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [15:04:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw2426 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:04:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete entry [puppet] - 10https://gerrit.wikimedia.org/r/1005971 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:05:49] (03PS3) 10Filippo Giunchedi: thanos: ship tool to analyze query apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) [15:05:51] (03PS2) 10Filippo Giunchedi: thanos: limit query length in frontend [puppet] - 10https://gerrit.wikimedia.org/r/1006006 (https://phabricator.wikimedia.org/T356788) [15:06:06] (03PS1) 10Kamila Součková: shellbox: bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006007 [15:06:11] PROBLEM - Check whether ferm is active by checking the default input chain on mw2419 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:08:01] (03CR) 10Jelto: etherpad: disable auto restart when etherpad is stopped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:08:04] (03PS2) 10Eevans: restbase: provision restbase1036-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005592 (https://phabricator.wikimedia.org/T354560) [15:08:05] (03PS2) 10Eevans: restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560) [15:08:07] (03PS2) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) [15:08:10] (03PS2) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) [15:08:12] (03PS2) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) [15:08:14] (03PS2) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [15:08:16] (03PS2) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [15:08:40] (03CR) 10Muehlenhoff: etherpad: disable auto restart when etherpad is stopped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:10:03] (03PS1) 10Btullis: Improve the superset nginx reverse proxy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005542 (https://phabricator.wikimedia.org/T357890) [15:10:48] (03PS2) 10Btullis: Improve the superset nginx reverse proxy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005542 (https://phabricator.wikimedia.org/T357890) [15:12:00] (03PS3) 10Btullis: Improve the superset nginx reverse proxy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005542 (https://phabricator.wikimedia.org/T357890) [15:12:24] (03PS3) 10Jelto: etherpad: disable auto restart when etherpad is stopped [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) [15:12:35] (03PS4) 10Btullis: Improve the superset nginx reverse proxy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005542 (https://phabricator.wikimedia.org/T357890) [15:13:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw2394 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:13:59] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:14:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [15:14:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:16:49] (03CR) 10Jelto: [V: 03+1 C: 03+2] etherpad: disable auto restart when etherpad is stopped [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:17:05] (03CR) 10Jelto: [V: 03+1 C: 03+2] etherpad: disable auto restart when etherpad is stopped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006004 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:23:25] (SystemdUnitFailed) firing: (9) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [15:27:23] (03CR) 10Btullis: [C: 03+1] "This looks good to me. I'm also happy to keep an eye on this when we deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005974 (https://phabricator.wikimedia.org/T249745) (owner: 10Clément Goubert) [15:27:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2196.mgmt.codfw.wmnet with reboot policy FORCED [15:27:46] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9572268 (10Jelto) [15:27:58] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#8953468 (10Jelto) [15:28:59] RECOVERY - Check whether ferm is active by checking the default input chain on mw2382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:33:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2196.mgmt.codfw.wmnet with reboot policy FORCED [15:34:13] RECOVERY - Check whether ferm is active by checking the default input chain on mw2426 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:36:07] (03Abandoned) 10Gehel: ApiFeatureUsage logstash servers are owned by Observability. [puppet] - 10https://gerrit.wikimedia.org/r/869582 (https://phabricator.wikimedia.org/T325880) (owner: 10Gehel) [15:36:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw2419 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:36:19] (03CR) 10Clément Goubert: [C: 03+2] eventgate-main: Raise worker_heap_limit_mb to 500 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005974 (https://phabricator.wikimedia.org/T249745) (owner: 10Clément Goubert) [15:37:05] (03Abandoned) 10Gehel: to illustrate comment on parent CR [puppet] - 10https://gerrit.wikimedia.org/r/963949 (owner: 10Gehel) [15:37:13] (03Merged) 10jenkins-bot: eventgate-main: Raise worker_heap_limit_mb to 500 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005974 (https://phabricator.wikimedia.org/T249745) (owner: 10Clément Goubert) [15:37:42] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! I like the solution you implemented, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1006006 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:38:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T357189)', diff saved to https://phabricator.wikimedia.org/P57828 and previous config saved to /var/cache/conftool/dbconfig/20240223-153832-arnaudb.json [15:38:33] !log Deploying 1005974 to eventgate-main - T249745 [15:38:40] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:38:42] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:51] T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 [15:38:54] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:39:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [15:40:04] (03CR) 10Andrea Denisse: thanos: ship tool to analyze query apache access logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006000 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:40:23] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [15:41:38] (03Abandoned) 10Gehel: tlsproxy: manage ssl_ecdhe_curve internally [puppet] - 10https://gerrit.wikimedia.org/r/789559 (https://phabricator.wikimedia.org/T307510) (owner: 10Gehel) [15:43:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [15:44:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [15:46:36] (03CR) 10Herron: [C: 03+1] "good idea, LGTM! one minor question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1006006 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [15:48:25] (SystemdUnitFailed) firing: (6) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:27] 10SRE, 10Infrastructure-Foundations, 10Mail: Integrations tests - https://phabricator.wikimedia.org/T358355#9572381 (10jhathaway) [15:48:39] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:03] RECOVERY - Check whether ferm is active by checking the default input chain on mw1457 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:51:11] RECOVERY - Check whether ferm is active by checking the default input chain on mw1494 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:53:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P57829 and previous config saved to /var/cache/conftool/dbconfig/20240223-155338-arnaudb.json [16:05:41] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P57830 and previous config saved to /var/cache/conftool/dbconfig/20240223-160845-arnaudb.json [16:09:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1100.eqiad.wmnet,service=(cdn|ats-be) [16:13:46] (03Abandoned) 10Ssingh: conftool: update schema for dnsbox for anycast authdns setups [puppet] - 10https://gerrit.wikimedia.org/r/1004205 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:13:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw2394 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:22:14] (03CR) 10David Caro: [C: 03+1] "LGTM \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1004069 (owner: 10Majavah) [16:23:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T357189)', diff saved to https://phabricator.wikimedia.org/P57831 and previous config saved to /var/cache/conftool/dbconfig/20240223-162351-arnaudb.json [16:23:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [16:23:59] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:24:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [16:24:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T357189)', diff saved to https://phabricator.wikimedia.org/P57832 and previous config saved to /var/cache/conftool/dbconfig/20240223-162426-arnaudb.json [16:24:31] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1004068 (owner: 10Majavah) [16:26:46] (03CR) 10Majavah: [C: 03+2] openstack: designate: use root@wmcloud.org in SOA records [puppet] - 10https://gerrit.wikimedia.org/r/1004068 (owner: 10Majavah) [16:26:55] (03CR) 10Majavah: [C: 03+2] openstack: stop creating PROJECT.wmflabs.org zones [puppet] - 10https://gerrit.wikimedia.org/r/1004069 (owner: 10Majavah) [16:28:16] (03PS1) 10Ssingh: conftool-data: add dnsbox hosts data [puppet] - 10https://gerrit.wikimedia.org/r/1006021 (https://phabricator.wikimedia.org/T347054) [16:29:31] (03CR) 10Btullis: [C: 03+1] "Now setting this to +1." [dns] - 10https://gerrit.wikimedia.org/r/998440 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [16:39:05] (03PS1) 10Btullis: Rename victorops-analytics to victorops-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1006047 (https://phabricator.wikimedia.org/T344202) [16:45:20] (03CR) 10Btullis: "Shall we move this check to the new data-platform team in Alertmanager?" [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [16:59:41] PROBLEM - superset.wikimedia.org requires authentication on an-tool1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:00:31] RECOVERY - superset.wikimedia.org requires authentication on an-tool1010 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 548 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:03:42] (03CR) 10BBlack: [C: 03+1] conftool-data: add dnsbox hosts data [puppet] - 10https://gerrit.wikimedia.org/r/1006021 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:13:25] (SystemdUnitFailed) resolved: (3) ferm.service on kubernetes2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:28] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9572625 (10thcipriani) so ` httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444` from... [17:16:20] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9572630 (10dancy) >>! In T358117#9572625, @thcipriani wrote: > so ` httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug... [17:25:37] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:28:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T357189)', diff saved to https://phabricator.wikimedia.org/P57833 and previous config saved to /var/cache/conftool/dbconfig/20240223-172856-arnaudb.json [17:29:03] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:32:31] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:36:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2040 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:42:47] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review, 10User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9572727 (10Trizek-WMF) [17:44:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P57834 and previous config saved to /var/cache/conftool/dbconfig/20240223-174403-arnaudb.json [17:49:17] (03PS5) 10RLazarus: mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) [17:49:19] (03PS8) 10RLazarus: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) [17:51:22] (03CR) 10RLazarus: [C: 03+2] mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [17:51:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9572739 (10wiki_willy) Hi @ssingh - the hardware should still be around, and we should be able to reallocate one of them for testing purposes. Can you shoot open a new Phab... [17:52:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9572740 (10BTullis) [17:53:26] (03Merged) 10jenkins-bot: mediawiki: Support one-off jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/988849 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [17:53:49] (03CR) 10RLazarus: [C: 03+2] Add helmfile for running MediaWiki one-off jobs. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [17:55:02] (03Merged) 10jenkins-bot: Add helmfile for running MediaWiki one-off jobs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988850 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [17:55:14] (03CR) 10RLazarus: [C: 03+2] deployment_server: Add mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/988851 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [17:55:44] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:50] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [17:59:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P57835 and previous config saved to /var/cache/conftool/dbconfig/20240223-175909-arnaudb.json [18:01:37] (03PS1) 10RLazarus: admin: Add kappalaya to sre-admins in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1006051 [18:03:06] (03CR) 10CI reject: [V: 04-1] admin: Add kappalaya to sre-admins in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1006051 (owner: 10RLazarus) [18:03:08] (03CR) 10BryanDavis: [C: 03+1] Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [18:04:59] (03CR) 10RLazarus: "Oh right of course, because this refers to the sre-admins posix group, not the sre-admins ldap group. I guess it's a separate question whe" [puppet] - 10https://gerrit.wikimedia.org/r/1006051 (owner: 10RLazarus) [18:05:07] (03Abandoned) 10RLazarus: admin: Add kappalaya to sre-admins in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1006051 (owner: 10RLazarus) [18:05:24] (03PS1) 10Btullis: Add a partition recipe for an-redacteddb [puppet] - 10https://gerrit.wikimedia.org/r/1006052 (https://phabricator.wikimedia.org/T355571) [18:11:56] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9572779 (10Jhancock.wm) drive has been replaced. Physically I don't have any alarms, but let me know if the you are still having issues with the RAID. [18:14:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T357189)', diff saved to https://phabricator.wikimedia.org/P57838 and previous config saved to /var/cache/conftool/dbconfig/20240223-181416-arnaudb.json [18:14:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:14:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:14:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:14:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T357189)', diff saved to https://phabricator.wikimedia.org/P57839 and previous config saved to /var/cache/conftool/dbconfig/20240223-181437-arnaudb.json [18:29:41] (03PS1) 10JHathaway: dev: explicitly don't use SRV records for dcl [puppet] - 10https://gerrit.wikimedia.org/r/1006056 [18:35:35] (03PS1) 10Ssingh: dns6001: set confd_enabled to false [puppet] - 10https://gerrit.wikimedia.org/r/1006057 (https://phabricator.wikimedia.org/T347054) [18:37:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1454/co" [puppet] - 10https://gerrit.wikimedia.org/r/1006057 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:37:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [18:39:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [18:41:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:42:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [18:43:31] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1036-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005592 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [18:44:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1006056 (owner: 10JHathaway) [18:45:54] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2196'] [18:46:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2196'] [18:46:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2196'] [18:47:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2197'] [18:47:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2197'] [18:47:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2198'] [18:48:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2198'] [18:48:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2199'] [18:48:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2199'] [18:49:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2200'] [18:49:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2200'] [18:49:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2201'] [18:50:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2201'] [18:50:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2202'] [18:50:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2202'] [18:50:44] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2204'] [18:50:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2204'] [18:51:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2205'] [18:51:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2205'] [18:51:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2206'] [18:51:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2196'] [18:52:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2206'] [18:52:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2207'] [18:52:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2207'] [18:52:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2207'] [18:53:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2200'] [18:53:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2207'] [18:54:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2208'] [18:54:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2208'] [18:55:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2209'] [18:55:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2209'] [18:55:34] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2210'] [18:55:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2210'] [18:55:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2211'] [18:56:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2211'] [18:56:27] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2212'] [18:56:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2212'] [18:56:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2213'] [18:57:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2213'] [18:57:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2214'] [18:57:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2214'] [18:58:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2200'] [19:01:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2215'] [19:01:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2215'] [19:01:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2216'] [19:02:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2216'] [19:02:15] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2217'] [19:02:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2217'] [19:02:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2218'] [19:02:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2218'] [19:02:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2219'] [19:03:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2219'] [19:03:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2220'] [19:03:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2220'] [19:03:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2203.mgmt.codfw.wmnet with reboot policy FORCED [19:03:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2220'] [19:03:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2220'] [19:04:37] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1036.eqiad.wmnet with reason: Bootstrapping — T354560 [19:04:43] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [19:04:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1036.eqiad.wmnet with reason: Bootstrapping — T354560 [19:12:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T357189)', diff saved to https://phabricator.wikimedia.org/P57840 and previous config saved to /var/cache/conftool/dbconfig/20240223-191243-arnaudb.json [19:12:49] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:27:19] (03PS1) 10Cathal Mooney: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) [19:27:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P57841 and previous config saved to /var/cache/conftool/dbconfig/20240223-192749-arnaudb.json [19:28:31] (03CR) 10CI reject: [V: 04-1] LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [19:30:59] (03PS2) 10Cathal Mooney: LVS: Only allow IPv6 default route from RAs on primary interface [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) [19:42:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P57842 and previous config saved to /var/cache/conftool/dbconfig/20240223-194255-arnaudb.json [19:43:27] arnaudb@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [19:47:17] (03PS1) 10FNegri: P:wmcs::backup_cinder_volumes: avoid race condition [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) [19:48:40] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:42] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [19:49:59] (03PS2) 10FNegri: P:wmcs::backup_cinder_volumes: avoid race condition [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) [19:50:38] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [19:52:32] (03CR) 10JHathaway: [C: 03+2] dev: explicitly don't use SRV records for dcl [puppet] - 10https://gerrit.wikimedia.org/r/1006056 (owner: 10JHathaway) [19:58:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T357189)', diff saved to https://phabricator.wikimedia.org/P57843 and previous config saved to /var/cache/conftool/dbconfig/20240223-195802-arnaudb.json [19:58:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [19:58:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:58:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [19:58:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T357189)', diff saved to https://phabricator.wikimedia.org/P57844 and previous config saved to /var/cache/conftool/dbconfig/20240223-195835-arnaudb.json [19:59:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2197'] [19:59:27] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2198'] [19:59:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2199'] [19:59:34] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2201'] [19:59:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2202'] [19:59:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2203'] [20:00:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2204'] [20:00:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2205'] [20:05:42] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2199'] [20:06:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2201'] [20:06:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2204'] [20:07:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2197'] [20:07:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2198'] [20:07:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2202'] [20:07:13] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:07:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2203'] [20:07:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2205'] [20:07:44] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:08:14] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:08:45] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:09:16] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:09:46] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:10:16] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:10:48] jhancock@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [20:18:52] (03PS1) 10Ilias Sarantopoulos: ml-services: update article descriptions image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006074 (https://phabricator.wikimedia.org/T358195) [20:23:46] !log [relog due to stashbot errors] jhancock@cumin2002 ran cookbook SRE.hardware.upgrade-firmware for hosts db2201/db2204/db2197/db2198/db2202/db2203/db2205 and all END PASS [20:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:16] PROBLEM - SSH on an-worker1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:34:14] RECOVERY - SSH on an-worker1136 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:34:30] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:36:53] (03PS2) 10Ilias Sarantopoulos: ml-services: update article descriptions image and add GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006074 (https://phabricator.wikimedia.org/T358195) [20:41:59] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update article descriptions image and add GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006074 (https://phabricator.wikimedia.org/T358195) (owner: 10Ilias Sarantopoulos) [20:42:49] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [20:48:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2196.codfw.wmnet with OS bookworm [20:48:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9573343 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm [20:52:34] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:56:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T357189)', diff saved to https://phabricator.wikimedia.org/P57845 and previous config saved to /var/cache/conftool/dbconfig/20240223-205630-arnaudb.json [20:56:37] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:03:49] Hey all - I know it’s Friday, but I was going to try to get an updated security mitigation out for T336027. It’s a PS.php deploy, targeted to a very specific set of users on a specific (small) set of wikis. [21:04:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [21:07:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2196.codfw.wmnet with reason: host reimage [21:09:25] (SystemdUnitFailed) firing: prometheus-dpkg-success-textfile.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P57846 and previous config saved to /var/cache/conftool/dbconfig/20240223-211136-arnaudb.json [21:26:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P57847 and previous config saved to /var/cache/conftool/dbconfig/20240223-212643-arnaudb.json [21:41:00] sbassett: seems resonable as long as there's an SRE around in case something goes awry [21:41:22] sbassett, thcipriani: i can scap if needed [21:41:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T357189)', diff saved to https://phabricator.wikimedia.org/P57848 and previous config saved to /var/cache/conftool/dbconfig/20240223-214149-arnaudb.json [21:41:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [21:41:57] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:42:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [21:42:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T357189)', diff saved to https://phabricator.wikimedia.org/P57850 and previous config saved to /var/cache/conftool/dbconfig/20240223-214211-arnaudb.json [21:43:30] thci [21:43:45] thcipriani brennen: tx, scapping out change right now [21:43:55] cool. [21:44:25] (SystemdUnitFailed) resolved: prometheus-dpkg-success-textfile.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:25] (SystemdUnitFailed) firing: prometheus-dpkg-success-textfile.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:53] !log Deployed updated security mitigation for T336027 [21:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:21] Deploy seems stable so far [21:57:25] (SystemdUnitFailed) resolved: prometheus-dpkg-success-textfile.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T357189)', diff saved to https://phabricator.wikimedia.org/P57852 and previous config saved to /var/cache/conftool/dbconfig/20240223-222920-arnaudb.json [22:29:27] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:44:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P57853 and previous config saved to /var/cache/conftool/dbconfig/20240223-224427-arnaudb.json [22:59:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P57854 and previous config saved to /var/cache/conftool/dbconfig/20240223-225933-arnaudb.json [23:14:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T357189)', diff saved to https://phabricator.wikimedia.org/P57855 and previous config saved to /var/cache/conftool/dbconfig/20240223-231440-arnaudb.json [23:14:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [23:14:49] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:14:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [23:48:41] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:59:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [23:59:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [23:59:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T357189)', diff saved to https://phabricator.wikimedia.org/P57856 and previous config saved to /var/cache/conftool/dbconfig/20240223-235919-arnaudb.json [23:59:25] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189