[00:00:20] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:50] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:43] 10SRE, 10Infrastructure-Foundations: investigate making 'notrack' the default on our ferm rules - https://phabricator.wikimedia.org/T240495 (10Dzahn) [00:15:39] 10SRE, 10serviceops: Request to block ActionApi client (based on a specific user agent header) - https://phabricator.wikimedia.org/T243858 (10Dzahn) [00:17:50] 10SRE, 10Infrastructure-Foundations, 10Security: Access requests process: Consideration of 'indirect' sudo rules via e.g. keyholder - https://phabricator.wikimedia.org/T207739 (10Dzahn) [00:19:30] 10SRE, 10Data-Engineering, 10Security: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532 (10Dzahn) [00:21:46] 10SRE, 10Infrastructure-Foundations: Improve management of users/groups on servers in production - https://phabricator.wikimedia.org/T235161 (10Dzahn) [00:25:16] 10SRE: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132 (10Dzahn) I am going to close this as resolved as the one who created it back in 2015. Since we haven't had updates here since 2016 and people are using the search console all the time it seems safe to assume there is no... [00:25:30] 10SRE: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132 (10Dzahn) 05Open→03Resolved a:03Dzahn [00:26:07] 10SRE, 10SRE-Access-Requests: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283 (10Dzahn) [00:26:14] 10SRE, 10Infrastructure-Foundations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966 (10Dzahn) [00:29:05] (03PS3) 10Dwisehaupt: Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) [00:29:42] (03CR) 10jenkins-bot: Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [00:30:13] (03PS4) 10Dwisehaupt: Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) [00:30:55] 10SRE, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) @Muehlenhoff Grepping through the puppet repo I dont see any "mod_access_compat" nowadays. The httpd class still has the legacy_compat option but nobody uses it anymore: ` modules/httpd/spec/c... [00:31:22] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10Dzahn) [00:36:26] RECOVERY - Check systemd state on dbstore1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:55] 10SRE, 10Infrastructure-Foundations: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10Dzahn) [00:36:57] 10SRE, 10Infrastructure-Foundations: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10Dzahn) Looking at this again today I would say the difference is that membership in "deployment" group just gives you a limited set of commands you can run as root while perf-... [00:38:28] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916924 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916924 (owner: 10TrainBranchBot) [00:39:49] 10SRE, 10Pywikibot: WMFTimeoutException on non-existent files - https://phabricator.wikimedia.org/T245374 (10Dzahn) Tempting to just merge this ticket into T89971 then. [00:50:28] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (2) wdqs2006:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:57:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/916924 (owner: 10TrainBranchBot) [01:04:03] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804 (10Dzahn) [01:04:08] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804 (10Dzahn) ferm services without srange: category A, those that seem obviously public but are not made explicit as this ticket suggests: ` modules/role/manifests... [01:05:41] 10SRE, 10observability, 10Goal: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10Dzahn) [01:34:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:34:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [01:34:20] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:... [01:37:52] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10Papaul) @aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs and re-image it so you can put the node in service. Thanks [01:44:12] 10SRE, 10Keyholder, 10Release-Engineering-Team (Seen): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10Dzahn) The repo on Phabricator has been closed meanwhile. I wonder if there are things left to do here now in 2023. [01:46:48] 10SRE, 10serviceops: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Dzahn) [01:48:32] 10SRE, 10observability, 10serviceops: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Dzahn) [01:50:52] 10SRE, 10Infrastructure Security: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750 (10Dzahn) [01:51:06] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Dzahn) [01:52:19] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Dzahn) > Only two people (mutante and moritzm) have permissions to add users (via signing the users file) This will be fixed by T333212. [01:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:07:54] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:16] (03PS1) 10Andrew Bogott: clouds.yaml: specify system_scope for special system sections [puppet] - 10https://gerrit.wikimedia.org/r/917986 (https://phabricator.wikimedia.org/T330759) [02:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:14:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:54] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:32] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:27:54] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:16] (03PS1) 10Andrew Bogott: mwopenstackclients.py: support system-scoped sessions from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/917988 [02:28:36] (03PS2) 10Andrew Bogott: mwopenstackclients.py: support system-scoped sessions from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/917988 [02:30:21] (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: specify system_scope for special system sections [puppet] - 10https://gerrit.wikimedia.org/r/917986 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [02:31:45] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients.py: support system-scoped sessions from clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/917988 (owner: 10Andrew Bogott) [02:33:08] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: use clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/917992 [02:33:48] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: use clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/917992 (owner: 10Andrew Bogott) [02:37:17] (03PS1) 10Andrew Bogott: clouds.yaml: further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917993 (https://phabricator.wikimedia.org/T330759) [02:41:37] (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917993 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [02:46:59] (03PS1) 10Andrew Bogott: clouds.yaml: yet further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917994 (https://phabricator.wikimedia.org/T330759) [02:47:58] (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: yet further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917994 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [02:49:52] (03PS1) 10Andrew Bogott: clouds.yaml: and yet still further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917995 (https://phabricator.wikimedia.org/T330759) [02:50:39] (03CR) 10Andrew Bogott: [C: 03+2] clouds.yaml: and yet still further fixes to support system scope sections [puppet] - 10https://gerrit.wikimedia.org/r/917995 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [03:02:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wcqs1002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:38:17] (03PS4) 10Dzahn: gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) [03:39:03] (03CR) 10CI reject: [V: 04-1] gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [03:40:15] (03PS5) 10Dzahn: gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) [03:43:58] (03PS6) 10Dzahn: gerrit: switch service IP, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) [04:10:15] !log gerrit1001 - rsyncing data over to gerrit1003, as root in a screen, but slowly with bwlimit 5m [04:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:46] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:40] PROBLEM - Host db1225 is DOWN: PING CRITICAL - Packet loss = 100% [04:50:36] (03PS1) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [05:03:22] (03PS1) 10RLazarus: remote: Clarify wait_reboot_since output [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 [05:05:48] (03CR) 10RLazarus: "Riccardo: I'm open to discussion about the details of how this should look, including whether or not it should still be an exception -- I " [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [05:06:49] I am checking db1225 [05:08:45] (03PS1) 10KartikMistry: Update MinT to 2023-05-10-045734-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918002 (https://phabricator.wikimedia.org/T331505) [05:08:55] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) Thank you Papaul, just got in! [05:11:03] * kart_ updating MinT [05:11:28] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-10-045734-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918002 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [05:12:14] (03Merged) 10jenkins-bot: Update MinT to 2023-05-10-045734-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/918002 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [05:17:08] RECOVERY - Host db1225 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [05:18:44] PROBLEM - MariaDB Replica IO: s2 on db1225 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:50] (03PS1) 10Marostegui: db1225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/918004 (https://phabricator.wikimedia.org/T336326) [05:21:15] (03CR) 10Marostegui: "jcrespo remember to revert this once you are ready to get the host back in production" [puppet] - 10https://gerrit.wikimedia.org/r/918004 (https://phabricator.wikimedia.org/T336326) (owner: 10Marostegui) [05:21:22] (03CR) 10Marostegui: [C: 03+2] db1225: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/918004 (https://phabricator.wikimedia.org/T336326) (owner: 10Marostegui) [05:26:27] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:28:08] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:32:27] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:35:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:37:07] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:42:03] 10SRE, 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) 05Open→03Resolved [05:42:05] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10Marostegui) [05:42:21] (03PS1) 10Marostegui: Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/917737 [05:42:24] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:42:42] !log Updated MinT to 2023-05-10-045734-production (T331505) [05:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:46] T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505 [05:43:45] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new LVS host lvs2012 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/917924 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [05:44:31] (03PS2) 10Muehlenhoff: Failover idp.w.o for reboot [dns] - 10https://gerrit.wikimedia.org/r/917852 [05:47:49] (03PS1) 10Marostegui: db2151: Migrat to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918240 (https://phabricator.wikimedia.org/T334650) [05:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2151', diff saved to https://phabricator.wikimedia.org/P48060 and previous config saved to /var/cache/conftool/dbconfig/20230510-054833-root.json [05:49:48] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ferien) [05:50:17] (03CR) 10Marostegui: [C: 03+2] db2151: Migrat to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918240 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [05:52:18] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ferien) >>! In T336293#8838955, @Ladsgroup wrote: > According to https://meta.wikimedia.org/wiki/Mailing_lists/Standardization the address of that mailing list will be wikiquote-en-admin... [05:53:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48061 and previous config saved to /var/cache/conftool/dbconfig/20230510-055300-root.json [05:53:48] (03PS1) 10Marostegui: install_server: Do not reimage db1221 [puppet] - 10https://gerrit.wikimedia.org/r/918241 [05:54:22] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1221 [puppet] - 10https://gerrit.wikimedia.org/r/918241 (owner: 10Marostegui) [05:58:16] (03CR) 10Marostegui: [C: 03+2] Revert "db2180: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/917737 (owner: 10Marostegui) [05:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48062 and previous config saved to /var/cache/conftool/dbconfig/20230510-055929-root.json [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T0600) [06:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2180', diff saved to https://phabricator.wikimedia.org/P48063 and previous config saved to /var/cache/conftool/dbconfig/20230510-060656-root.json [06:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48064 and previous config saved to /var/cache/conftool/dbconfig/20230510-060805-root.json [06:10:56] (03CR) 10Muehlenhoff: "Few comments inline, looks good in general" [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [06:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:15:24] (03PS4) 10Alexandros Kosiaris: services_proxy: Add machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/911887 (https://phabricator.wikimedia.org/T331505) [06:15:26] (03PS1) 10Alexandros Kosiaris: machinetranslation: Switch service::catalog to production [puppet] - 10https://gerrit.wikimedia.org/r/918243 (https://phabricator.wikimedia.org/T331505) [06:16:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [06:18:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Switch service::catalog to production [puppet] - 10https://gerrit.wikimedia.org/r/918243 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [06:18:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Add machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/911887 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [06:19:28] (03PS4) 10Alexandros Kosiaris: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [06:19:57] (03PS14) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [06:20:24] (03PS1) 10Slyngshede: mgmt module [software/bitu] - 10https://gerrit.wikimedia.org/r/918245 [06:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48065 and previous config saved to /var/cache/conftool/dbconfig/20230510-062309-root.json [06:23:32] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:26:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [06:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48066 and previous config saved to /var/cache/conftool/dbconfig/20230510-063814-root.json [06:40:12] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o for reboot [dns] - 10https://gerrit.wikimedia.org/r/917852 (owner: 10Muehlenhoff) [06:41:21] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depool db1112 db1212 T336252', diff saved to https://phabricator.wikimedia.org/P48067 and previous config saved to /var/cache/conftool/dbconfig/20230510-064119-marostegui.json [06:41:24] T336252: Failover s3 sanitarium master - https://phabricator.wikimedia.org/T336252 [06:44:04] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:12] !log dbmaint eqiad failover s3 sanitarium master T336252 [06:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48068 and previous config saved to /var/cache/conftool/dbconfig/20230510-064433-root.json [06:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48069 and previous config saved to /var/cache/conftool/dbconfig/20230510-064439-root.json [06:45:56] (03PS1) 10Marostegui: mariadb: Promote db1212 to s3 sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/918347 (https://phabricator.wikimedia.org/T336252) [06:46:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1212 to s3 sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/918347 (https://phabricator.wikimedia.org/T336252) (owner: 10Marostegui) [06:48:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:53] (03PS1) 10Phedenskog: Enable First Input Delay events. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) [06:50:48] (03CR) 10Phedenskog: "I think this is what's missing right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918348 (https://phabricator.wikimedia.org/T332012) (owner: 10Phedenskog) [06:52:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bookworm [06:52:52] (03CR) 10Volans: "Thanks for the patch, I've made some minor improvement suggestions inline. LMK what do you think." [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [06:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48070 and previous config saved to /var/cache/conftool/dbconfig/20230510-065319-root.json [06:59:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48071 and previous config saved to /var/cache/conftool/dbconfig/20230510-065938-root.json [06:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48072 and previous config saved to /var/cache/conftool/dbconfig/20230510-065944-root.json [07:00:06] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48073 and previous config saved to /var/cache/conftool/dbconfig/20230510-070824-root.json [07:09:00] jouncebot: now [07:09:00] For the next 0 hour(s) and 50 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T0700) [07:14:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48074 and previous config saved to /var/cache/conftool/dbconfig/20230510-071443-root.json [07:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48075 and previous config saved to /var/cache/conftool/dbconfig/20230510-071449-root.json [07:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48076 and previous config saved to /var/cache/conftool/dbconfig/20230510-072329-root.json [07:23:31] 10SRE, 10Keyholder, 10Release-Engineering-Team (Seen): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) One can compare the content of https://phabricator.wikimedia.org/source/keyholder.git with https://gerrit.wikimedia.org/r/operations/software/keyholder Also per Fa... [07:23:35] (03PS8) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [07:23:49] (03CR) 10Slyngshede: Django 3.2 support (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [07:24:14] (03PS1) 10Hashar: Merge tag 'v3.5.6' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918352 [07:26:20] (03PS2) 10Hashar: Merge tag 'v3.5.6' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918352 (https://phabricator.wikimedia.org/T336339) [07:28:24] (03PS1) 10Ayounsi: Netbox 3.5: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [07:28:48] (03CR) 10CI reject: [V: 04-1] Netbox 3.5: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:29:05] (03PS3) 10Hashar: Merge tag 'v3.5.6' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918352 (https://phabricator.wikimedia.org/T336339) [07:29:16] (03PS2) 10Slyngshede: signup: allow blocking of username with regex [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) [07:29:27] (03CR) 10Slyngshede: signup: allow blocking of username with regex (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [07:29:29] (03PS2) 10Ayounsi: Netbox 3.5: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [07:29:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48077 and previous config saved to /var/cache/conftool/dbconfig/20230510-072948-root.json [07:29:52] (03CR) 10CI reject: [V: 04-1] Netbox 3.5: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48078 and previous config saved to /var/cache/conftool/dbconfig/20230510-072954-root.json [07:31:31] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:31:50] (03CR) 10Ayounsi: "The CI error seems unrelated to this change." [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:34:25] (03PS1) 10Muehlenhoff: Remove check for duplicated ops permissions [puppet] - 10https://gerrit.wikimedia.org/r/918354 [07:38:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48079 and previous config saved to /var/cache/conftool/dbconfig/20230510-073833-root.json [07:39:44] (03CR) 10Jelto: [C: 03+2] miscweb annualreport: update redirect for 2022 report [puppet] - 10https://gerrit.wikimedia.org/r/917814 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [07:41:47] (03PS1) 10Marostegui: db2117: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918355 (https://phabricator.wikimedia.org/T334650) [07:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2117 T334650', diff saved to https://phabricator.wikimedia.org/P48080 and previous config saved to /var/cache/conftool/dbconfig/20230510-074237-root.json [07:42:41] T334650: Migrate s6 to MariaDB 10.6 - https://phabricator.wikimedia.org/T334650 [07:43:10] (03CR) 10Marostegui: [C: 03+2] db2117: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918355 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:44:38] (03CR) 10Muehlenhoff: "Mostly a proposal, let me know what you think. Alternatively I can also fix up the check to account for fr-tech-admins" [puppet] - 10https://gerrit.wikimedia.org/r/918354 (owner: 10Muehlenhoff) [07:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48081 and previous config saved to /var/cache/conftool/dbconfig/20230510-074452-root.json [07:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48082 and previous config saved to /var/cache/conftool/dbconfig/20230510-074458-root.json [07:45:05] (03CR) 10Elukey: [C: 03+1] Failover the kadminserver to krb2002 [puppet] - 10https://gerrit.wikimedia.org/r/917359 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [07:45:08] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [07:45:22] (03CR) 10Elukey: [C: 03+1] ml-cache: upgrade Cassandra to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [07:45:26] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [07:45:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [07:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48083 and previous config saved to /var/cache/conftool/dbconfig/20230510-074555-root.json [07:48:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host testvm2005.codfw.wmnet with OS bookworm [07:49:57] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [07:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [07:52:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:35] (03PS1) 10Filippo Giunchedi: hieradata: add eqsin to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/918358 (https://phabricator.wikimedia.org/T336345) [07:57:38] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [07:59:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/918358 (https://phabricator.wikimedia.org/T336345) (owner: 10Filippo Giunchedi) [07:59:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48084 and previous config saved to /var/cache/conftool/dbconfig/20230510-075957-root.json [08:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48085 and previous config saved to /var/cache/conftool/dbconfig/20230510-080003-root.json [08:00:04] hashar and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T0800). [08:01:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48086 and previous config saved to /var/cache/conftool/dbconfig/20230510-080100-root.json [08:01:02] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add eqsin to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/918358 (https://phabricator.wikimedia.org/T336345) (owner: 10Filippo Giunchedi) [08:03:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [08:04:21] (03PS1) 10Ayounsi: Netbox 3.5: getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918359 (https://phabricator.wikimedia.org/T336275) [08:04:36] o/ [08:04:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918366 [08:04:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918366 (owner: 10TrainBranchBot) [08:05:31] I don't know what this wmf/branch_cut_pretest is for :) [08:07:02] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918360 (https://phabricator.wikimedia.org/T330214) [08:07:04] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918360 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [08:07:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48087 and previous config saved to /var/cache/conftool/dbconfig/20230510-080736-root.json [08:07:47] (03PS1) 10Hashar: Update Gerrit to v3.5.6 [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918361 (https://phabricator.wikimedia.org/T336339) [08:07:57] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918360 (https://phabricator.wikimedia.org/T330214) (owner: 10TrainBranchBot) [08:14:54] !log re-enable eqsin remote syslog towards centrallog - T336345 [08:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] T336345: No logs are sent to centrallog from eqsin - https://phabricator.wikimedia.org/T336345 [08:15:02] (03PS2) 10Hashar: Update Gerrit to v3.5.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918361 (https://phabricator.wikimedia.org/T336339) [08:15:05] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.8 refs T330214 [08:15:09] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [08:15:46] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.6' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918352 (https://phabricator.wikimedia.org/T336339) (owner: 10Hashar) [08:16:01] (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.5.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918361 (https://phabricator.wikimedia.org/T336339) (owner: 10Hashar) [08:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48088 and previous config saved to /var/cache/conftool/dbconfig/20230510-081605-root.json [08:18:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:01] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.8 refs T330214 (duration: 05m 55s) [08:21:05] T330214: 1.41.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T330214 [08:21:13] (03Merged) 10jenkins-bot: Merge tag 'v3.5.6' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918352 (https://phabricator.wikimedia.org/T336339) (owner: 10Hashar) [08:21:23] (03Merged) 10jenkins-bot: Update Gerrit to v3.5.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918361 (https://phabricator.wikimedia.org/T336339) (owner: 10Hashar) [08:22:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/918366 (owner: 10TrainBranchBot) [08:22:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48089 and previous config saved to /var/cache/conftool/dbconfig/20230510-082240-root.json [08:27:22] (03PS1) 10Marostegui: db2187: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918363 (https://phabricator.wikimedia.org/T334650) [08:29:23] (03CR) 10Marostegui: [C: 03+2] db2187: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/918363 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [08:30:20] (03CR) 10Jbond: [C: 03+2] sre.hardware.sel: add simple cookbook for querying the SEL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [08:31:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48090 and previous config saved to /var/cache/conftool/dbconfig/20230510-083109-root.json [08:31:12] (03PS1) 10Volans: reports: accounting fix callers to parent methods [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918365 [08:32:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [08:32:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1137.eqiad.wmnet with reason: Maintenance [08:32:52] (03Merged) 10jenkins-bot: sre.hardware.sel: add simple cookbook for querying the SEL [cookbooks] - 10https://gerrit.wikimedia.org/r/902135 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [08:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T335845)', diff saved to https://phabricator.wikimedia.org/P48091 and previous config saved to /var/cache/conftool/dbconfig/20230510-083253-ladsgroup.json [08:35:51] (03CR) 10Volans: [C: 03+1] Remove check for duplicated ops permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918354 (owner: 10Muehlenhoff) [08:36:13] (03CR) 10Ayounsi: [C: 03+1] "Can't be more broken than right now :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918365 (owner: 10Volans) [08:37:11] MediaWiki train on group 1 looks good so far [08:37:27] (03PS1) 10Filippo Giunchedi: site: add thanos-fe[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/918387 (https://phabricator.wikimedia.org/T336348) [08:37:30] (03PS1) 10Daniel Kinzler: Enable parser cache warming jobs for parsoid on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918388 (https://phabricator.wikimedia.org/T329366) [08:37:32] (03CR) 10Volans: [C: 03+2] reports: accounting fix callers to parent methods [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918365 (owner: 10Volans) [08:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48092 and previous config saved to /var/cache/conftool/dbconfig/20230510-083745-root.json [08:38:04] (03Merged) 10jenkins-bot: reports: accounting fix callers to parent methods [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918365 (owner: 10Volans) [08:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:39:11] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41100/console" [puppet] - 10https://gerrit.wikimedia.org/r/918387 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [08:39:47] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [08:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T335845)', diff saved to https://phabricator.wikimedia.org/P48093 and previous config saved to /var/cache/conftool/dbconfig/20230510-083948-ladsgroup.json [08:39:51] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary [08:40:09] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [08:40:15] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [08:40:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol: allow services to be contacted by all cloudlb HAproxy [puppet] - 10https://gerrit.wikimedia.org/r/917904 (https://phabricator.wikimedia.org/T332153) (owner: 10Arturo Borrero Gonzalez) [08:42:30] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41101/console" [puppet] - 10https://gerrit.wikimedia.org/r/918387 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [08:42:47] I am going to upgrade Gerrit from 3.5.5 to 3.5.6 [08:43:01] to address a potential denial of service (and add a few fixes) [08:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:43:56] 10SRE, 10SRE Observability, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) All right the upstream issue has been resolved! Next steps: 1) Upgrade our benthos debian package to the new 4.15.0 u... [08:44:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:16] .6 [08:45:18] err :) [08:46:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48094 and previous config saved to /var/cache/conftool/dbconfig/20230510-084614-root.json [08:46:51] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) a:05Papaul→03aborrero >>! In T336236#8839386, @Papaul wrote: > @aborrero i went ahead and setup the node in site.pp with role::insetup::wmcs... [08:47:11] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: give it proper role [puppet] - 10https://gerrit.wikimedia.org/r/918390 (https://phabricator.wikimedia.org/T336236) [08:48:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:48:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:48:34] !log deploy1002: git reset `/srv/deployment/gerrit/gerrit` which had bunch of locally modified files for some reason # T336339 [08:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:38] T336339: Upgrade Gerrit from 3.5.5 to 3.5.6 - https://phabricator.wikimedia.org/T336339 [08:49:49] !log hashar@deploy1002 Started deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit2002 | T336339 [08:49:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit2002 | T336339 (duration: 00m 07s) [08:50:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: give it proper role [puppet] - 10https://gerrit.wikimedia.org/r/918390 (https://phabricator.wikimedia.org/T336236) (owner: 10Arturo Borrero Gonzalez) [08:51:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [08:51:19] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet... [08:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48095 and previous config saved to /var/cache/conftool/dbconfig/20230510-085250-root.json [08:53:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [08:53:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [08:53:29] (03PS5) 10Elukey: Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) [08:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T335845)', diff saved to https://phabricator.wikimedia.org/P48096 and previous config saved to /var/cache/conftool/dbconfig/20230510-085330-ladsgroup.json [08:54:32] I am stopping and upgrading Gerrit right now [08:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P48097 and previous config saved to /var/cache/conftool/dbconfig/20230510-085455-ladsgroup.json [08:55:52] !log Stopping Gerrit for 3.5.5 > 3.5.6 upgrade T336339 [08:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:55] T336339: Upgrade Gerrit from 3.5.5 to 3.5.6 - https://phabricator.wikimedia.org/T336339 [08:56:21] !log hashar@deploy1002 Started deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit1001 | T336339 [08:56:30] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit1001 | T336339 (duration: 00m 09s) [08:56:53] forgot to deploy first bah [08:57:16] !log hashar@deploy1002 Started deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit1001 | T336339 [08:57:21] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@67ba7ab]: Gerrit to 3.5.6 on gerrit1001 | T336339 (duration: 00m 05s) [08:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T335845)', diff saved to https://phabricator.wikimedia.org/P48098 and previous config saved to /var/cache/conftool/dbconfig/20230510-085910-ladsgroup.json [08:59:17] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:36] (ProbeDown) firing: (4) Service gerrit1001:443 has failed probes (http_gerrit_tls_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:38] (ProbeDown) firing: Service gerrit1001:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1001:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:45] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:27] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:55] (03PS1) 10Alexandros Kosiaris: service::proxy: Correctly name machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/918406 [09:01:17] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48099 and previous config saved to /var/cache/conftool/dbconfig/20230510-090119-root.json [09:01:31] !log Gerrit restarted at version 3.5.6 | T336339 [09:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:35] T336339: Upgrade Gerrit from 3.5.5 to 3.5.6 - https://phabricator.wikimedia.org/T336339 [09:01:44] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [09:01:47] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:49] (03PS1) 10Alexandros Kosiaris: cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) [09:01:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::proxy: Correctly name machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/918406 (owner: 10Alexandros Kosiaris) [09:02:14] (03CR) 10Stevemunene: Create scap deployment source for product analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912834 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [09:02:29] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:32] (03CR) 10CI reject: [V: 04-1] cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:02:59] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:03:25] I think this is beacuse of the gerrit mainteance [09:04:18] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ladsgroup) Responded there. [09:04:36] (ProbeDown) resolved: (4) Service gerrit1001:443 has failed probes (http_gerrit_tls_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:38] (ProbeDown) resolved: Service gerrit1001:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1001:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:20] (03PS1) 10Majavah: P:base::firewall: add non-etcd way to reject traffic [puppet] - 10https://gerrit.wikimedia.org/r/918408 [09:05:23] (03PS2) 10Alexandros Kosiaris: cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) [09:05:53] (03CR) 10Elukey: [C: 03+2] Add the VIP settings for the K8s ingress for ml-staging [dns] - 10https://gerrit.wikimedia.org/r/914730 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [09:06:16] (03CR) 10CI reject: [V: 04-1] cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [09:06:23] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [09:07:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48100 and previous config saved to /var/cache/conftool/dbconfig/20230510-090755-root.json [09:09:35] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage [09:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P48101 and previous config saved to /var/cache/conftool/dbconfig/20230510-091001-ladsgroup.json [09:10:10] (03PS1) 10Elukey: service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) [09:10:35] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:12:09] (03PS3) 10Alexandros Kosiaris: cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) [09:12:48] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [09:13:37] (03CR) 10DCausse: [C: 04-1] rdf-streaming-updater: Increase task manager memory alloc (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) (owner: 10Bking) [09:14:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/918387 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [09:14:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P48102 and previous config saved to /var/cache/conftool/dbconfig/20230510-091417-ladsgroup.json [09:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48103 and previous config saved to /var/cache/conftool/dbconfig/20230510-091624-root.json [09:17:07] (03PS1) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [09:18:39] (03PS2) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [09:19:19] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/917361 (owner: 10Majavah) [09:19:23] (03CR) 10David Caro: [C: 03+2] toolforge: wmcs-package-build: support backports and -tools packages [puppet] - 10https://gerrit.wikimedia.org/r/917361 (owner: 10Majavah) [09:19:48] (03PS2) 10EoghanGaffney: [aphlict] Remove aphlict1001 CNAME [dns] - 10https://gerrit.wikimedia.org/r/917873 (https://phabricator.wikimedia.org/T333452) [09:20:28] (03CR) 10Majavah: [C: 04-1] firewall::extra: add a way to block addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [09:20:54] (03CR) 10David Caro: [C: 03+2] toolforge: wmcs-package-build: support .git suffix in URLs [puppet] - 10https://gerrit.wikimedia.org/r/916787 (owner: 10Majavah) [09:22:23] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable general perimetral firewall [puppet] - 10https://gerrit.wikimedia.org/r/918411 [09:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48104 and previous config saved to /var/cache/conftool/dbconfig/20230510-092259-root.json [09:25:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T335845)', diff saved to https://phabricator.wikimedia.org/P48105 and previous config saved to /var/cache/conftool/dbconfig/20230510-092507-ladsgroup.json [09:25:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [09:25:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Maintenance [09:25:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1220 (T335845)', diff saved to https://phabricator.wikimedia.org/P48106 and previous config saved to /var/cache/conftool/dbconfig/20230510-092531-ladsgroup.json [09:27:02] (03CR) 10Effie Mouzeli: [C: 03+1] Enable parser cache warming jobs for parsoid on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918388 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [09:27:34] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] site: add thanos-fe[12]004 [puppet] - 10https://gerrit.wikimedia.org/r/918387 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [09:28:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918388 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [09:29:00] (03Merged) 10jenkins-bot: Enable parser cache warming jobs for parsoid on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918388 (https://phabricator.wikimedia.org/T329366) (owner: 10Daniel Kinzler) [09:29:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P48107 and previous config saved to /var/cache/conftool/dbconfig/20230510-092923-ladsgroup.json [09:30:01] !log daniel@deploy1002 Started scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] [09:30:05] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [09:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P48108 and previous config saved to /var/cache/conftool/dbconfig/20230510-093128-root.json [09:31:34] !log daniel@deploy1002 daniel: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:37:13] (03PS3) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [09:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T335845)', diff saved to https://phabricator.wikimedia.org/P48109 and previous config saved to /var/cache/conftool/dbconfig/20230510-093743-ladsgroup.json [09:37:47] (03PS4) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [09:37:56] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [09:38:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48110 and previous config saved to /var/cache/conftool/dbconfig/20230510-093804-root.json [09:38:12] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] (duration: 08m 10s) [09:38:15] T329366: Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 [09:39:29] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1004.eqiad.wmnet [09:42:44] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41105/console" [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [09:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T335845)', diff saved to https://phabricator.wikimedia.org/P48111 and previous config saved to /var/cache/conftool/dbconfig/20230510-094429-ladsgroup.json [09:44:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:44:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T335845)', diff saved to https://phabricator.wikimedia.org/P48112 and previous config saved to /var/cache/conftool/dbconfig/20230510-094452-ladsgroup.json [09:50:39] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-fe1004.eqiad.wmnet [09:50:41] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2004.codfw.wmnet [09:51:08] (03PS1) 10Jbond: puppet-diffs: move hiera settings to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/918413 [09:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T335845)', diff saved to https://phabricator.wikimedia.org/P48113 and previous config saved to /var/cache/conftool/dbconfig/20230510-095130-ladsgroup.json [09:52:32] (03CR) 10Jbond: [C: 03+2] puppet-diffs: move hiera settings to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/918413 (owner: 10Jbond) [09:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P48114 and previous config saved to /var/cache/conftool/dbconfig/20230510-095250-ladsgroup.json [09:53:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48115 and previous config saved to /var/cache/conftool/dbconfig/20230510-095309-root.json [09:54:07] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: stronger openstack_controllers override [puppet] - 10https://gerrit.wikimedia.org/r/918414 (https://phabricator.wikimedia.org/T336236) [09:55:40] (03PS1) 10Elukey: fastapi-app: change the default port settings from 80 to 8080 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918415 (https://phabricator.wikimedia.org/T330414) [09:57:05] (03PS1) 10Elukey: ml-services: bump docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918416 (https://phabricator.wikimedia.org/T330414) [09:58:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:59:38] PROBLEM - Check systemd state on thanos-fe1004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-store.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1000) [10:00:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:00:20] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1720 days) https://wikitech.wikimedia.org/wiki/Logs [10:00:30] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/918415 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:01:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: stronger openstack_controllers override [puppet] - 10https://gerrit.wikimedia.org/r/918414 (https://phabricator.wikimedia.org/T336236) (owner: 10Arturo Borrero Gonzalez) [10:01:45] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-fe2004.codfw.wmnet [10:02:24] (03CR) 10Elukey: [C: 03+2] fastapi-app: change the default port settings from 80 to 8080 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918415 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:02:38] PROBLEM - Check systemd state on thanos-fe2004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-store.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:38] (03CR) 10Elukey: [C: 03+2] ml-services: bump docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918416 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [10:03:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:26] (03PS1) 10Ladsgroup: Remove db1113 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/918417 (https://phabricator.wikimedia.org/T336029) [10:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P48116 and previous config saved to /var/cache/conftool/dbconfig/20230510-100636-ladsgroup.json [10:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P48117 and previous config saved to /var/cache/conftool/dbconfig/20230510-100756-ladsgroup.json [10:08:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1113.eqiad.wmnet [10:10:15] (03PS1) 10Filippo Giunchedi: thanos: add thanos-fe[12]004 to memcache and conftool [puppet] - 10https://gerrit.wikimedia.org/r/918418 (https://phabricator.wikimedia.org/T336348) [10:11:06] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) [10:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:59] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [10:15:40] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: disable HAproxy config for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) [10:16:11] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:17:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:17:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1113.eqiad.wmnet [10:19:29] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 [10:19:38] (03CR) 10Majavah: "you could use dnsquery::lookup here I think too" [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:19:55] (03CR) 10CI reject: [V: 04-1] openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 (owner: 10Arturo Borrero Gonzalez) [10:19:58] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Thanos [10:21:38] !log start of clean up of echo notification in wikidatawiki (T318523) [10:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:41] T318523: Don't send article-linked notifications for bots - https://phabricator.wikimedia.org/T318523 [10:21:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P48118 and previous config saved to /var/cache/conftool/dbconfig/20230510-102142-ladsgroup.json [10:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T335845)', diff saved to https://phabricator.wikimedia.org/P48119 and previous config saved to /var/cache/conftool/dbconfig/20230510-102302-ladsgroup.json [10:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:24:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918411 (owner: 10Arturo Borrero Gonzalez) [10:25:30] PROBLEM - Thanos swift https on thanos-fe2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Thanos [10:25:46] (03CR) 10Ladsgroup: [C: 03+2] Remove db1113 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/918417 (https://phabricator.wikimedia.org/T336029) (owner: 10Ladsgroup) [10:26:12] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) As the #WMF-Legal project tag was added to this task, some general information to avoid wron... [10:26:47] !log Removing db1113 from zarcillo T336029 [10:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:51] T336029: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 [10:28:42] (03PS2) 10Elukey: service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) [10:29:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed - https://phabricator.wikimedia.org/T336326 (10jcrespo) @dc-ops What is the feedback for a "CPU0704 CPU 1 machine check error detected." hardware crash. Is there a BIOS update or something we can do about it? [10:29:48] (03PS1) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) [10:29:58] (03PS1) 10Elukey: profile::services_proxy::envoy: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/918425 [10:30:02] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Ladsgroup) [10:30:10] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [10:30:15] (03CR) 10Jbond: [C: 03+1] "lgtm optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:30:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::services_proxy::envoy: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/918425 (owner: 10Elukey) [10:30:51] (03PS3) 10Hnowlan: thumbor: haproxy timeout changes, block /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) [10:34:02] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 [10:34:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) [10:34:27] (03CR) 10CI reject: [V: 04-1] openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 (owner: 10Arturo Borrero Gonzalez) [10:34:38] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) a:05jcrespo→03None [10:35:05] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 [10:35:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) [10:36:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: keystone: install the /etc/keystone/domains directory [puppet] - 10https://gerrit.wikimedia.org/r/918421 (owner: 10Arturo Borrero Gonzalez) [10:36:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T335845)', diff saved to https://phabricator.wikimedia.org/P48120 and previous config saved to /var/cache/conftool/dbconfig/20230510-103649-ladsgroup.json [10:36:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [10:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [10:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T335845)', diff saved to https://phabricator.wikimedia.org/P48121 and previous config saved to /var/cache/conftool/dbconfig/20230510-103712-ladsgroup.json [10:38:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:42:15] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) @aborrero @dcaro @Andrew I think we are in a position to look at doing this again? I've updated the li... [10:42:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:43:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T335845)', diff saved to https://phabricator.wikimedia.org/P48122 and previous config saved to /var/cache/conftool/dbconfig/20230510-104337-ladsgroup.json [10:45:57] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto) [10:47:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:48:19] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [10:49:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918354 (owner: 10Muehlenhoff) [10:51:19] (03CR) 10DCausse: [C: 04-1] rdf-streaming-updater: Increase task manager memory alloc (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) (owner: 10Bking) [10:52:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye [10:53:03] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye complete... [10:54:15] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) [10:54:42] (03CR) 10Arturo Borrero Gonzalez: cloudlb: use dnsquery::lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:54:48] (03CR) 10CI reject: [V: 04-1] cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:55:25] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/917873 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [10:55:42] (03CR) 10EoghanGaffney: [C: 03+2] [aphlict] Remove aphlict1001 CNAME [dns] - 10https://gerrit.wikimedia.org/r/917873 (https://phabricator.wikimedia.org/T333452) (owner: 10EoghanGaffney) [10:58:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P48123 and previous config saved to /var/cache/conftool/dbconfig/20230510-105843-ladsgroup.json [11:00:20] (03PS1) 10Jelto: gitlab: run backup sync and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) [11:01:31] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [11:02:37] (03CR) 10Jbond: "i cant see why ci is breaking now but not before" [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:03:05] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) [11:03:37] (03CR) 10CI reject: [V: 04-1] cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:04:32] (03CR) 10Arturo Borrero Gonzalez: cloudlb: use dnsquery::lookup() (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:06:36] (03PS1) 10Jcrespo: dbbackups: Move db1225 backups to db1139 & db1150 [puppet] - 10https://gerrit.wikimedia.org/r/918428 (https://phabricator.wikimedia.org/T336326) [11:10:44] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move db1225 backups to db1139 & db1150 [puppet] - 10https://gerrit.wikimedia.org/r/918428 (https://phabricator.wikimedia.org/T336326) (owner: 10Jcrespo) [11:11:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2005.codfw.wmnet with OS bookworm [11:11:40] (03PS3) 10Slyngshede: Sphinx: Start work on documentation [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 [11:11:55] (03CR) 10Slyngshede: Sphinx: Start work on documentation (0313 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/908769 (owner: 10Slyngshede) [11:12:09] (03PS1) 10Jcrespo: Revert "db1225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918446 [11:13:12] (03CR) 10Jcrespo: [C: 04-2] "No merge until T336326 is resolved or almost fully resolved." [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo) [11:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P48124 and previous config saved to /var/cache/conftool/dbconfig/20230510-111349-ladsgroup.json [11:14:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove check for duplicated ops permissions [puppet] - 10https://gerrit.wikimedia.org/r/918354 (owner: 10Muehlenhoff) [11:16:48] (03PS1) 10Muehlenhoff: Remove bastion role from bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/918430 [11:17:15] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Lemonaka) Done per request of @Ladsgroup [11:18:29] <_joe_> !log installing vopsbot 0.3.4 on alert1001 T329791 [11:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:33] T329791: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 [11:18:54] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Ladsgroup) 05Open→03Resolved Done: https://lists.wikimedia.org/postorius/lists/wikiquote-en-admins.lists.wikimedia.org/ Create an account and you can add more people or change any o... [11:20:23] (03CR) 10KartikMistry: [C: 03+1] cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [11:23:33] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have created a Wikitech account to be used for this purpose. {F36990981,width=70%} Wikit... [11:23:48] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [11:25:51] (03CR) 10Jbond: [V: 03+1] "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:26:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:27:34] (03PS9) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [11:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T335845)', diff saved to https://phabricator.wikimedia.org/P48125 and previous config saved to /var/cache/conftool/dbconfig/20230510-112855-ladsgroup.json [11:28:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41107/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [11:29:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [11:29:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41108/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:29:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:31:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:32:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:32:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T335845)', diff saved to https://phabricator.wikimedia.org/P48126 and previous config saved to /var/cache/conftool/dbconfig/20230510-113215-ladsgroup.json [11:33:40] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm now, thanks! I'm going to test this change on the WMCS test instance and production first, then we can proceed with enabling OIDC if " [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:35:31] (03PS15) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [11:37:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T335845)', diff saved to https://phabricator.wikimedia.org/P48127 and previous config saved to /var/cache/conftool/dbconfig/20230510-113734-ladsgroup.json [11:37:51] (03PS16) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T331505) [11:38:24] 10SRE, 10Infrastructure-Foundations: Bitu IDM - Feedback - https://phabricator.wikimedia.org/T335470 (10Aklapper) >>! In T335470#8809866, @SLyngshede-WMF wrote: > @Aklapper are the SRE and Infrastructure-Foundations tags sufficient for now? It's hard to find IDM tasks only when there's no dedicated tag. :) Bu... [11:38:30] (03PS4) 10Jbond: cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:39:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:29] (03CR) 10Jbond: [C: 03+1] "updated" [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:40:40] (03CR) 10KartikMistry: [C: 03+2] Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [11:41:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bookworm [11:41:40] (03Merged) 10jenkins-bot: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [11:41:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: use dnsquery::lookup() [puppet] - 10https://gerrit.wikimedia.org/r/918419 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:43:00] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:43:20] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:26] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:46:56] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:49:24] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:49:58] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:50:31] (03CR) 10Slyngshede: [C: 03+2] signup: allow blocking of username with regex [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [11:50:33] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] signup: allow blocking of username with regex [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) (owner: 10Slyngshede) [11:52:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P48128 and previous config saved to /var/cache/conftool/dbconfig/20230510-115241-ladsgroup.json [11:52:56] (03PS1) 10KartikMistry: MinT: Fix api URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/918433 [11:54:28] (03CR) 10KartikMistry: [C: 03+2] MinT: Fix api URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/918433 (owner: 10KartikMistry) [11:55:12] (03Merged) 10jenkins-bot: MinT: Fix api URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/918433 (owner: 10KartikMistry) [11:56:22] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:56:32] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:57:30] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:57:47] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:57:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: enable general perimetral firewall [puppet] - 10https://gerrit.wikimedia.org/r/918411 (owner: 10Arturo Borrero Gonzalez) [11:58:25] (03PS1) 10Slyngshede: C:idm enable signup blocklist [puppet] - 10https://gerrit.wikimedia.org/r/918436 [11:58:31] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:58:47] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:01:31] (03CR) 10Slyngshede: [C: 03+2] C:idm enable signup blocklist [puppet] - 10https://gerrit.wikimedia.org/r/918436 (owner: 10Slyngshede) [12:02:42] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: fix cloud_filter configuration [puppet] - 10https://gerrit.wikimedia.org/r/918437 [12:05:02] (03CR) 10CI reject: [V: 04-1] cloudgw: fix cloud_filter configuration [puppet] - 10https://gerrit.wikimedia.org/r/918437 (owner: 10Arturo Borrero Gonzalez) [12:05:38] 10SRE, 10Bitu, 10Infrastructure-Foundations: Bitu IDM - Feedback - https://phabricator.wikimedia.org/T335470 (10SLyngshede-WMF) > It's hard to find IDM tasks only when there's no dedicated tag. :) But handled in T336155 now! Awesome, thank you. [12:06:31] (03CR) 10KartikMistry: [C: 03+2] cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:06:42] (03PS4) 10KartikMistry: cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P48129 and previous config saved to /var/cache/conftool/dbconfig/20230510-120747-ladsgroup.json [12:08:17] (03CR) 10KartikMistry: [C: 03+2] cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:08:58] (03Merged) 10jenkins-bot: cxserver: Enable machintranslation proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/918407 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [12:10:02] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:10:15] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:11:57] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:12:16] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:13:24] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:13:39] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:17:09] (03PS1) 10Jbond: cloudlb: drop force_ipv6 param to lookup as its currently borked [puppet] - 10https://gerrit.wikimedia.org/r/918439 [12:17:19] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add cloud_private_subnet placeholder data [puppet] - 10https://gerrit.wikimedia.org/r/918440 [12:18:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudlb: drop force_ipv6 param to lookup as its currently borked [puppet] - 10https://gerrit.wikimedia.org/r/918439 (owner: 10Jbond) [12:19:34] (03CR) 10Jbond: [C: 03+2] cloudlb: drop force_ipv6 param to lookup as its currently borked [puppet] - 10https://gerrit.wikimedia.org/r/918439 (owner: 10Jbond) [12:19:46] (03CR) 10Jgreen: [C: 03+2] Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [12:19:48] (03PS1) 10KartikMistry: cxserver: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/918441 (https://phabricator.wikimedia.org/T331505) [12:20:17] (03CR) 10ArielGlenn: dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [12:22:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T335845)', diff saved to https://phabricator.wikimedia.org/P48131 and previous config saved to /var/cache/conftool/dbconfig/20230510-122253-ladsgroup.json [12:22:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:23:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:23:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T335845)', diff saved to https://phabricator.wikimedia.org/P48132 and previous config saved to /var/cache/conftool/dbconfig/20230510-122316-ladsgroup.json [12:23:21] (03CR) 10KartikMistry: [C: 03+2] cxserver: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/918441 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:24:04] (03Merged) 10jenkins-bot: cxserver: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/918441 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [12:26:50] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: fix cloud_filter configuration [puppet] - 10https://gerrit.wikimedia.org/r/918437 [12:27:52] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:27:58] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T335845)', diff saved to https://phabricator.wikimedia.org/P48133 and previous config saved to /var/cache/conftool/dbconfig/20230510-122828-ladsgroup.json [12:29:25] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:29:33] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:29:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: fix cloud_filter configuration [puppet] - 10https://gerrit.wikimedia.org/r/918437 (owner: 10Arturo Borrero Gonzalez) [12:30:24] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Thanos [12:30:44] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:30:49] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:32:32] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add cloud_private_subnet placeholder data [puppet] - 10https://gerrit.wikimedia.org/r/918440 [12:34:42] (03PS2) 10Filippo Giunchedi: thanos: add thanos-fe[12]004 to memcache and conftool [puppet] - 10https://gerrit.wikimedia.org/r/918418 (https://phabricator.wikimedia.org/T336348) [12:34:44] (03PS1) 10Filippo Giunchedi: hieradata: force ipv4 for thanos tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) [12:35:17] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/output/918440/41112/" [puppet] - 10https://gerrit.wikimedia.org/r/918440 (owner: 10Arturo Borrero Gonzalez) [12:35:45] (03CR) 10Filippo Giunchedi: "Newly provisioned thanos-fe hosts come with ipv6 by default, whereas older hosts don't. Hence the reason we haven't run into this before" [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [12:36:02] RECOVERY - Check systemd state on thanos-fe1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:42] (03CR) 10Btullis: [C: 03+1] "Looks good. Many thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [12:43:06] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:43:23] that's me ^ [12:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P48134 and previous config saved to /var/cache/conftool/dbconfig/20230510-124334-ladsgroup.json [12:44:04] (03PS1) 10Btullis: Update the container image used to run datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/918466 (https://phabricator.wikimedia.org/T329514) [12:45:56] (03CR) 10Ottomata: "Very nice! Nits in line!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [12:48:28] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) [12:48:55] 10SRE, 10Data-Engineering, 10Security: Use user-specific passwords for accessing Analytics MariaDB replica databases - https://phabricator.wikimedia.org/T120532 (10Ottomata) [12:49:59] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) [12:51:09] (03CR) 10Klausman: [C: 03+1] service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [12:51:58] (03CR) 10Klausman: [C: 03+1] ml-cache: upgrade Cassandra to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [12:52:57] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org [12:54:39] (03CR) 10Btullis: [C: 03+1] Create scap deployment source for product analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/912834 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [12:56:54] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2002.wikimedia.org [12:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P48135 and previous config saved to /var/cache/conftool/dbconfig/20230510-125840-ladsgroup.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1300). [13:00:05] Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:02:12] (03CR) 10Stevemunene: [C: 03+2] Add analytics_product admin group for airflow [puppet] - 10https://gerrit.wikimedia.org/r/914788 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [13:03:04] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [13:06:59] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet [13:08:32] I'm here :) [13:09:52] (03CR) 10Muehlenhoff: "Actually this patch needs additional work still:" [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [13:10:07] let's see [13:10:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917415 (https://phabricator.wikimedia.org/T336193) (owner: 10Superpes15) [13:11:01] (03CR) 10Majavah: "This works otherwise great, but I couldn't manage to set the _rejects variable in a local labs/private commit." [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [13:11:13] (03Merged) 10jenkins-bot: [arwikisource] Replace the current logo with an identical HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/917415 (https://phabricator.wikimedia.org/T336193) (owner: 10Superpes15) [13:11:40] !log taavi@deploy1002 Started scap: Backport for [[gerrit:917415|[arwikisource] Replace the current logo with an identical HD version (T336193)]] [13:11:44] T336193: Update ar.wikisource.org logo - https://phabricator.wikimedia.org/T336193 [13:12:30] PROBLEM - Bird Internet Routing Daemon on cloudlb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:12:48] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:02] PROBLEM - haproxy alive on cloudlb2001-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [13:13:10] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2001-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:13:13] !log taavi@deploy1002 superpes and taavi: Backport for [[gerrit:917415|[arwikisource] Replace the current logo with an identical HD version (T336193)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:13:17] Superpes: please test [13:13:18] Looking [13:13:36] PROBLEM - haproxy process on cloudlb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [13:13:38] It's fine! thanks :) [13:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T335845)', diff saved to https://phabricator.wikimedia.org/P48136 and previous config saved to /var/cache/conftool/dbconfig/20230510-131347-ladsgroup.json [13:13:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:14:01] great! syncing [13:14:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T335845)', diff saved to https://phabricator.wikimedia.org/P48137 and previous config saved to /var/cache/conftool/dbconfig/20230510-131412-ladsgroup.json [13:14:15] (03CR) 10Jbond: firewall::extra: add a way to block addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [13:15:40] PROBLEM - Check systemd state on cloudlb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:55] (03PS5) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [13:15:55] <_joe_> !log rolling back vopsbot to 0.3.3 [13:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:08] (03CR) 10Jbond: firewall::extra: add a way to block addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [13:19:41] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:917415|[arwikisource] Replace the current logo with an identical HD version (T336193)]] (duration: 08m 00s) [13:19:45] T336193: Update ar.wikisource.org logo - https://phabricator.wikimedia.org/T336193 [13:21:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T335845)', diff saved to https://phabricator.wikimedia.org/P48138 and previous config saved to /var/cache/conftool/dbconfig/20230510-132126-ladsgroup.json [13:21:41] @Taavi Many thanks :) [13:21:48] yw [13:25:06] (03PS1) 10Vgutierrez: traffic: Filter cp|dns instances on HAProxy alerts [alerts] - 10https://gerrit.wikimedia.org/r/918471 [13:25:08] (03PS5) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [13:26:04] (03CR) 10EoghanGaffney: [gitlab/failover] Add rollback method (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [13:27:22] (03PS6) 10EoghanGaffney: [gitlab/failover] Add rollback method [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 [13:30:40] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I don't believe that the iceberg jar is available yet, but it should be soon." [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo) [13:30:42] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10cmooney) @aborrero my apologies I messed up the vlan list for cloudgw2002. Should be ok now. ` cmooney@cloudsw1-b1-codfw> show arp no-resol... [13:31:38] (03CR) 10Muehlenhoff: [C: 03+2] Failover the kadminserver to krb2002 [puppet] - 10https://gerrit.wikimedia.org/r/917359 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [13:35:25] (03PS6) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [13:35:56] PROBLEM - Kerberos Kpropd daemon on krb2002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kpropd https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [13:36:08] PROBLEM - Kerberos KAdmin daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [13:36:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P48139 and previous config saved to /var/cache/conftool/dbconfig/20230510-133632-ladsgroup.json [13:36:41] (03PS2) 10Filippo Giunchedi: hieradata: force ipv4 for thanos tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) [13:36:44] (03PS3) 10Filippo Giunchedi: thanos: add thanos-fe[12]004 to memcache and conftool [puppet] - 10https://gerrit.wikimedia.org/r/918418 (https://phabricator.wikimedia.org/T336348) [13:36:55] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10cmooney) @aborrero re-reading the description it sounds like there may be some other issues? Let me know if there is anything specific, the... [13:38:58] RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:40:14] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41113/console" [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [13:40:54] (03PS1) 10Andrew Bogott: haproxy: increase http check interval [puppet] - 10https://gerrit.wikimedia.org/r/918475 (https://phabricator.wikimedia.org/T336379) [13:42:48] (03PS2) 10Andrew Bogott: haproxy: increase http check interval [puppet] - 10https://gerrit.wikimedia.org/r/918475 (https://phabricator.wikimedia.org/T336379) [13:43:20] <_joe_> !issync [13:44:30] (03PS1) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [13:44:37] (03PS1) 10Volans: reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 [13:45:01] (03CR) 10CI reject: [V: 04-1] toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [13:45:09] (03CR) 10David Caro: toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [13:45:11] (03CR) 10CI reject: [V: 04-1] reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 (owner: 10Volans) [13:45:26] (03PS1) 10Volans: reports.network: fix missing 'f' for f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 [13:45:43] (03PS2) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [13:45:45] (03CR) 10David Caro: toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [13:45:59] (03CR) 10CI reject: [V: 04-1] reports.network: fix missing 'f' for f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 (owner: 10Volans) [13:46:09] (03CR) 10CI reject: [V: 04-1] toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [13:46:15] <_joe_> !issync [13:46:16] Syncing #wikimedia-operations (requested by joe_oblivian) [13:46:17] Set /cs flags #wikimedia-operations sirenbot +Aitv [13:46:19] (03PS7) 10Jbond: firewall::extra: add a way to block addresses [puppet] - 10https://gerrit.wikimedia.org/r/918410 [13:47:45] (03CR) 10Eevans: [C: 03+1] hieradata: force ipv4 for thanos tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [13:48:06] (03PS3) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [13:48:52] (03PS1) 10Hashar: gerrit: manage dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/918481 [13:48:58] (03PS1) 10Hashar: scap: use gerrit dsh group from deployment server [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918482 [13:49:06] (03PS1) 10Elukey: fastapi-app: use port 8080 for probes as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/918483 (https://phabricator.wikimedia.org/T330414) [13:50:04] (03PS2) 10Volans: reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 [13:50:06] (03PS2) 10Volans: reports.network: fix missing 'f' for f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 [13:50:37] (03CR) 10Elukey: [C: 03+2] fastapi-app: use port 8080 for probes as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/918483 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:51:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P48140 and previous config saved to /var/cache/conftool/dbconfig/20230510-135138-ladsgroup.json [13:55:16] (03PS1) 10Elukey: fastapi-app: change app's port as well to 8080 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918484 (https://phabricator.wikimedia.org/T330414) [13:55:51] (03PS3) 10Volans: reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 [13:55:54] (03PS3) 10Volans: reports.network: fix missing 'f' for f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 [13:57:24] (03CR) 10Elukey: [C: 03+2] fastapi-app: change app's port as well to 8080 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918484 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [13:59:30] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 (10Joe) 05Open→03Resolved a:03Joe This is solved thanks to @Legoktm's patch. [14:01:43] (03CR) 10Cathal Mooney: [C: 03+1] "Nice! I wasn't aware of those prefetch operations very useful. Thanks :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 (owner: 10Volans) [14:02:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:03:48] (03CR) 10Hnowlan: [C: 03+2] thumbor: haproxy timeout changes, block /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:04:31] (03PS4) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) [14:04:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: omniauth_sync_profile_attributes shuold be a list [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:04:48] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:05:06] (03CR) 10Volans: [C: 03+2] reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 (owner: 10Volans) [14:05:25] (03CR) 10Volans: [C: 03+2] "trivial, self-merging" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 (owner: 10Volans) [14:05:41] (03Merged) 10jenkins-bot: reports.network: optimize queries to avoid timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 (owner: 10Volans) [14:05:57] (03CR) 10CI reject: [V: 04-1] Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [14:06:02] (03Merged) 10jenkins-bot: reports.network: fix missing 'f' for f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918480 (owner: 10Volans) [14:06:10] (03Merged) 10jenkins-bot: thumbor: haproxy timeout changes, block /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T335845)', diff saved to https://phabricator.wikimedia.org/P48142 and previous config saved to /var/cache/conftool/dbconfig/20230510-140644-ladsgroup.json [14:06:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:07:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:07:03] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [14:07:07] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary [14:07:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T335845)', diff saved to https://phabricator.wikimedia.org/P48143 and previous config saved to /var/cache/conftool/dbconfig/20230510-140708-ladsgroup.json [14:07:46] (03CR) 10Cathal Mooney: "Thanks for the review @fgiunchedi. Both points make sense. I've tried to update to take those on board, but for some reason the unit tes" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [14:08:13] (03PS8) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/918410 [14:08:17] (03CR) 10Ottomata: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [14:08:39] (03CR) 10Andrew Bogott: [C: 03+2] haproxy: increase http check interval [puppet] - 10https://gerrit.wikimedia.org/r/918475 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [14:08:43] (03PS5) 10Muehlenhoff: Add a generic Cassandra reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [14:08:48] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox [14:08:49] (03CR) 10Ottomata: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [14:08:55] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox [14:12:38] (03CR) 10Majavah: "few nits, feel free to fix or ignore" [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [14:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:13:18] (03PS1) 10AOkoth: site: revert vrts2001 role post re-image [puppet] - 10https://gerrit.wikimedia.org/r/918486 (https://phabricator.wikimedia.org/T323515) [14:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T335845)', diff saved to https://phabricator.wikimedia.org/P48144 and previous config saved to /var/cache/conftool/dbconfig/20230510-141427-ladsgroup.json [14:14:58] (03PS1) 10Andrew Bogott: Keystone: double the number of worker procs. [puppet] - 10https://gerrit.wikimedia.org/r/918488 (https://phabricator.wikimedia.org/T336379) [14:15:06] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [14:15:15] (03CR) 10David Caro: toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [14:15:38] (ProbeDown) firing: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:01] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: double the number of worker procs. [puppet] - 10https://gerrit.wikimedia.org/r/918488 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [14:16:09] (03CR) 10Cathal Mooney: [C: 03+1] reports.network: optimize queries to avoid timeout (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918479 (owner: 10Volans) [14:16:17] (03PS2) 10Bking: rdf-streaming-updater: Increase task manager memory alloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) [14:16:26] (03CR) 10Majavah: toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [14:18:33] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10bchoo) WMF Legal reviewed the contract on file for Bishop Fox and their employees should be covered u... [14:19:09] (03CR) 10Majavah: "one q, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [14:20:38] (ProbeDown) resolved: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:23] (03CR) 10Muehlenhoff: "Ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [14:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:23:34] (03PS1) 10Giuseppe Lavagetto: cxserver: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918489 [14:24:08] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10elukey) @Jclark-ctr Hi! Lemme know if you have some times during the next days (even next week, not urgent) to move one GPU over to ml-serve :) [14:25:07] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:25:14] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:25:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918489 (owner: 10Giuseppe Lavagetto) [14:26:24] jouncebot: nowandnext [14:26:24] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [14:26:24] In 0 hour(s) and 3 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1430) [14:26:30] !log gerrit1003 switchover happening [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:44] (03Merged) 10jenkins-bot: cxserver: update to mesh.configuration 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/918489 (owner: 10Giuseppe Lavagetto) [14:27:28] ...maybe :) [14:28:23] (03PS1) 10Hnowlan: thumbor: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/918494 [14:28:39] (03PS1) 10Volans: reports.network: better variable naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 [14:29:20] * sukhe will wait for the gerrit switchove [14:29:20] r [14:29:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P48145 and previous config saved to /var/cache/conftool/dbconfig/20230510-142934-ladsgroup.json [14:29:47] (03CR) 10Cathal Mooney: [C: 03+1] "Reads better I think. Sorry for the hassle and thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 (owner: 10Volans) [14:29:57] (03PS1) 10Giuseppe Lavagetto: cxserver: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/918497 [14:30:04] sukhe: #bothumor I � Unicode. All rise for LVS maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1430). [14:30:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [14:30:14] thanks sukhe and sorry, I remembered to send an email, but neglected my own deployment calendar :( [14:30:47] thcipriani: np! I have a two hour slot so it should be fine [14:30:49] go ahead please [14:30:57] (03PS2) 10Volans: reports.network: better variable naming [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918495 [14:31:00] cool, we're getting started [14:31:02] (03PS4) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [14:31:05] (03CR) 10David Caro: toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [14:33:44] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: force ipv4 for thanos tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/918443 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [14:33:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] dumps::distribution::ferm: update to resolve hosts in puppetmaster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [14:34:03] jbond: merged your patch too [14:34:17] godog: yes please [14:34:18] 10SRE, 10Wikimedia-Mailing-lists: Create English Wikiquote admin mailing list - https://phabricator.wikimedia.org/T336293 (10Lemonaka) Sure, done. [14:34:19] thanks [14:34:21] even :) [14:36:49] (03PS1) 10Urbanecm: [Growth] Add mediawiki.mentor_dashboard.interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918500 (https://phabricator.wikimedia.org/T325117) [14:37:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: migration [14:37:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: migration [14:37:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: migration [14:37:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: migration [14:37:53] RECOVERY - Check systemd state on thanos-fe2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:15] RECOVERY - Thanos swift https on thanos-fe2004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Thanos [14:40:13] !log stopping gerrit on gerrit1003 [14:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] !log stopping gerrit on gerrit1001 [14:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:16] !log installing libxml2 security updates on buster [14:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:33] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:51] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:01] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:33] ^ expected with gerrit stopped [14:44:40] !log restarting FPM/Apache on mw canaries to pick up libxml2 updates [14:44:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P48146 and previous config saved to /var/cache/conftool/dbconfig/20230510-144440-ladsgroup.json [14:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:43] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:27] ACKNOWLEDGEMENT - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service daniel_zahn migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:27] ACKNOWLEDGEMENT - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service daniel_zahn migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:27] ACKNOWLEDGEMENT - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service daniel_zahn migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:27] ACKNOWLEDGEMENT - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service daniel_zahn migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:45] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:54] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:19] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:01] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10serviceops-collab: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666 (10LSobanski) [14:53:17] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:32] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:32] (03CR) 10Thcipriani: [C: 03+1] gerrit: switch service IP, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:53:39] (03CR) 10Dzahn: [C: 03+2] gerrit: switch service IP, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:53:43] (03PS7) 10Dzahn: gerrit: switch service IP, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) [14:53:53] (03CR) 10Dzahn: [V: 03+2] gerrit: switch service IP, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [14:54:01] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:07] (03PS1) 10Alexandros Kosiaris: service::proxy: Add uses_ingress parameter for machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/918506 [14:54:23] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:31] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:03] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::proxy: Add uses_ingress parameter for machinetranslation [puppet] - 10https://gerrit.wikimedia.org/r/918506 (owner: 10Alexandros Kosiaris) [14:57:44] (03CR) 10Eevans: [C: 03+1] "Insofar as I understand all of this, LGTM 😊" [puppet] - 10https://gerrit.wikimedia.org/r/918418 (https://phabricator.wikimedia.org/T336348) (owner: 10Filippo Giunchedi) [14:58:28] (03PS1) 10Ssingh: varnish: bump size of varnish shared memory log to 160M (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/918507 (https://phabricator.wikimedia.org/T253093) [14:58:40] !log install vopsbot 0.3.4 on alert2001 T329791 [14:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] T329791: Vopsbot doesn't have channel topic rights - https://phabricator.wikimedia.org/T329791 [14:59:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T335845)', diff saved to https://phabricator.wikimedia.org/P48147 and previous config saved to /var/cache/conftool/dbconfig/20230510-145946-ladsgroup.json [14:59:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:00:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:00:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41114/console" [puppet] - 10https://gerrit.wikimedia.org/r/918507 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [15:00:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T335845)', diff saved to https://phabricator.wikimedia.org/P48148 and previous config saved to /var/cache/conftool/dbconfig/20230510-150009-ladsgroup.json [15:00:17] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [15:00:38] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [15:02:42] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [15:06:41] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) Per a discussion with @gmodena on IRC, I'll create an account named !... [15:07:04] (03PS1) 10Alexandros Kosiaris: cxserver: mesh configuration updated [deployment-charts] - 10https://gerrit.wikimedia.org/r/918509 (https://phabricator.wikimedia.org/T331505) [15:08:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: mesh configuration updated [deployment-charts] - 10https://gerrit.wikimedia.org/r/918509 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [15:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T335845)', diff saved to https://phabricator.wikimedia.org/P48149 and previous config saved to /var/cache/conftool/dbconfig/20230510-150838-ladsgroup.json [15:09:05] (03Merged) 10jenkins-bot: cxserver: mesh configuration updated [deployment-charts] - 10https://gerrit.wikimedia.org/r/918509 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [15:10:04] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [15:10:26] (03PS1) 10Dzahn: Revert "gerrit: switch service IP, turn new into current and current into old" [dns] - 10https://gerrit.wikimedia.org/r/918527 [15:11:00] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [15:11:28] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: switch service IP, turn new into current and current into old" [dns] - 10https://gerrit.wikimedia.org/r/918527 (owner: 10Dzahn) [15:12:12] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [15:12:28] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [15:14:11] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:14:19] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:16:07] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [15:16:38] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [15:17:37] !log running authdns-update for CR 918527 [15:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] (03CR) 10Thcipriani: [C: 03+1] gerrit: manage dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/918481 (owner: 10Hashar) [15:21:21] (03CR) 10Dzahn: [C: 03+2] gerrit: manage dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/918481 (owner: 10Hashar) [15:23:34] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [15:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P48150 and previous config saved to /var/cache/conftool/dbconfig/20230510-152345-ladsgroup.json [15:24:15] 10SRE, 10ops-knams, 10DC-Ops: Q4:knams: PDU installation - https://phabricator.wikimedia.org/T334280 (10RobH) 05Open→03Stalled Stalled until on-site work begins in Q1. [15:24:42] (03PS1) 10Andrew Bogott: nova-api: increase the number of nova-api workers 3x [puppet] - 10https://gerrit.wikimedia.org/r/918515 (https://phabricator.wikimedia.org/T336379) [15:25:26] (03CR) 10Andrew Bogott: [C: 03+2] nova-api: increase the number of nova-api workers 3x [puppet] - 10https://gerrit.wikimedia.org/r/918515 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [15:25:39] PROBLEM - MariaDB Replica Lag: s3 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 918.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:28:13] (03CR) 10Dzahn: [C: 03+1] site: revert vrts2001 role post re-image [puppet] - 10https://gerrit.wikimedia.org/r/918486 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:28:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] rdf-streaming-updater: Increase task manager memory alloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/917935 (https://phabricator.wikimedia.org/T336134) (owner: 10Bking) [15:30:28] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [15:32:32] for now we are not working on gerrit anymore. it should be normal [15:32:38] thanks mutante! [15:33:01] sukhe: LVS work can start now:) ty for help [15:33:12] jouncebot: nowandnext [15:33:12] For the next 0 hour(s) and 56 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1430) [15:33:12] In 1 hour(s) and 26 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1700) [15:33:13] thanks! [15:33:46] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [15:33:51] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [15:34:44] (03PS3) 10Ssingh: depool codfw (emergency patch, do not merge) [dns] - 10https://gerrit.wikimedia.org/r/914343 (https://phabricator.wikimedia.org/T335777) [15:34:49] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: drop support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/918517 (https://phabricator.wikimedia.org/T324992) [15:35:15] (03CR) 10Jbond: wmcs::firewall: add a way to block addresses in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918410 (owner: 10Jbond) [15:35:18] (03PS1) 10Ayounsi: Netbox 3.5: multiple cable terminations and endpoints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) [15:35:29] (03PS1) 10Btullis: Add an ldap_only user for bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/918519 (https://phabricator.wikimedia.org/T336357) [15:35:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] "FYi this has been merged hopefully we dont see any issues" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [15:35:54] (03CR) 10BCornwall: [C: 03+1] varnish: bump size of varnish shared memory log to 160M (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/918507 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [15:36:22] (03CR) 10Ssingh: [C: 03+2] lvs2012: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/917922 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [15:36:38] (03CR) 10CI reject: [V: 04-1] Netbox 3.5: multiple cable terminations and endpoints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:37:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:37:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:38:17] (03PS2) 10Ayounsi: Netbox 3.5: multiple cable terminations and endpoints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) [15:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P48151 and previous config saved to /var/cache/conftool/dbconfig/20230510-153851-ladsgroup.json [15:41:31] (03CR) 10Btullis: [C: 03+2] Update the container image used to run datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/918466 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:42:29] (03Merged) 10jenkins-bot: Update the container image used to run datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/918466 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:42:34] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [15:42:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [15:42:56] (03PS1) 10Andrew Bogott: Openstack eqiad1 galera: make cloudcontrol1007 the database primary [puppet] - 10https://gerrit.wikimedia.org/r/918522 (https://phabricator.wikimedia.org/T336379) [15:42:58] (03CR) 10Arturo Borrero Gonzalez: cloudlb: haproxy: drop support for IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918517 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:43:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:43:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:44:07] (03PS2) 10Andrew Bogott: Openstack eqiad1 galera: make cloudcontrol1006 the database primary [puppet] - 10https://gerrit.wikimedia.org/r/918522 (https://phabricator.wikimedia.org/T336379) [15:45:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack eqiad1 galera: make cloudcontrol1006 the database primary [puppet] - 10https://gerrit.wikimedia.org/r/918522 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [15:45:19] (03CR) 10Jbond: [C: 03+1] "lgtm possibly run pcc?" [puppet] - 10https://gerrit.wikimedia.org/r/918517 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:45:23] (03CR) 10Andrew Bogott: [C: 03+2] Openstack eqiad1 galera: make cloudcontrol1006 the database primary [puppet] - 10https://gerrit.wikimedia.org/r/918522 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [15:47:46] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [15:47:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [15:48:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [15:48:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: drop support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/918517 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:49:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [15:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T335845)', diff saved to https://phabricator.wikimedia.org/P48152 and previous config saved to /var/cache/conftool/dbconfig/20230510-155357-ladsgroup.json [15:54:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:54:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:54:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:54:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:54:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T335845)', diff saved to https://phabricator.wikimedia.org/P48153 and previous config saved to /var/cache/conftool/dbconfig/20230510-155429-ladsgroup.json [15:54:55] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) [15:55:21] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:56:22] (03PS6) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [15:57:14] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) [15:57:24] (03PS7) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [15:57:39] (03CR) 10CI reject: [V: 04-1] cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:58:10] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish: bump size of varnish shared memory log to 160M (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/918507 (https://phabricator.wikimedia.org/T253093) (owner: 10Ssingh) [15:58:17] (03PS5) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [15:58:23] (03PS3) 10Jbond: cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:58:45] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:58:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:59:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: http-service.cfg.erb: fix template [puppet] - 10https://gerrit.wikimedia.org/r/918523 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:59:28] (03CR) 10Dzahn: "This would also need adjustment of the httpbb tests on miscweb hosts I think." [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [15:59:38] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:59:51] (03CR) 10Hashar: "Thanks! The new group is populated on the deploy server:" [puppet] - 10https://gerrit.wikimedia.org/r/918481 (owner: 10Hashar) [16:00:27] (03CR) 10Hashar: [C: 03+2] "Dependent Puppet patch got deployed:" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918482 (owner: 10Hashar) [16:01:01] RECOVERY - Check systemd state on cloudlb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:06] (03Merged) 10jenkins-bot: scap: use gerrit dsh group from deployment server [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/918482 (owner: 10Hashar) [16:01:26] thcipriani: mutante: should I upgrade Gerrit on gerrit1003 right now? [16:01:37] RECOVERY - haproxy process on cloudlb2001-dev is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [16:01:38] note there is a uid/gid mismatch for the user gerrit2 between the hosts [16:01:47] I haven't reached that one yet [16:01:47] RECOVERY - Bird Internet Routing Daemon on cloudlb2001-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:02:11] (03CR) 10David Caro: [C: 03+2] toolforge_cli: add api gateway url and builds endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) (owner: 10David Caro) [16:02:20] !log sudo cumin -b1 -s1200 'A:cp and A:drmrs' 'varnish-frontend-restart': T253093 [16:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] T253093: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 [16:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T335845)', diff saved to https://phabricator.wikimedia.org/P48154 and previous config saved to /var/cache/conftool/dbconfig/20230510-160258-ladsgroup.json [16:03:06] (03PS6) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918478 (https://phabricator.wikimedia.org/T336225) [16:03:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:06:37] jouncebot: nowandnext [16:06:37] For the next 0 hour(s) and 23 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1430) [16:06:37] In 0 hour(s) and 53 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1700) [16:06:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [16:07:11] (03CR) 10BBlack: [C: 03+1] hiera: add dns2004 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/917881 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [16:09:46] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) 05Open→03Resolved As of today, cloudcontrol2001-dev.codfw.wmnet is a backend to cloudlb200X-dev.codfw.wmnet and they are communicating over... [16:10:10] hashar@gerrit-new.wikimedia.org: Permission denied (publickey). [16:10:10] grmblblbl [16:11:23] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:27] ah it has no repositories [16:13:52] so I don't think it is upgradable [16:14:36] hashar: thcipriani: it's because we deleted stuff.. we might have to rsync after all and then deploy? [16:15:26] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2012.codfw.wmnet with OS bullseye [16:15:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:16:18] hashar: or are you just running into the same thing I ran into? (key denied because local ssh.config applies to gerrit but not gerrit-new) [16:16:30] so it might just use the prod key vs the gerrit key [16:16:41] that confused me at first [16:18:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P48155 and previous config saved to /var/cache/conftool/dbconfig/20230510-161806-ladsgroup.json [16:19:31] jouncebot: next [16:19:31] In 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1700) [16:20:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:20:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:20:13] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: move openstack.codfw1dev.wikimediacloud.org to new VIP [dns] - 10https://gerrit.wikimedia.org/r/918525 (https://phabricator.wikimedia.org/T332153) [16:20:16] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10hashar) Upstream released a security update of Gerrit yesterday (3.5.6) I thus upgraded gerrit1001 and gerrit2002 to the new version this morning shortly a... [16:20:27] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2001-dev is OK: OK: UP (pid=1769849) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [16:22:13] RECOVERY - haproxy alive on cloudlb2001-dev is OK: OK check_alive uptime 1286s https://wikitech.wikimedia.org/wiki/HAProxy [16:22:38] (03PS1) 10Stevemunene: Place airflow1006 in airflow role [puppet] - 10https://gerrit.wikimedia.org/r/918566 (https://phabricator.wikimedia.org/T333000) [16:23:39] (03PS1) 10Dzahn: Revert "Revert "gerrit: switch service IP, turn new into current and current into old"" [dns] - 10https://gerrit.wikimedia.org/r/918529 [16:25:19] (03CR) 10Dzahn: "we are doing this again tomorrow" [dns] - 10https://gerrit.wikimedia.org/r/918529 (owner: 10Dzahn) [16:25:33] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [16:25:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:26:11] mutante: without git repositories, the gerrit1003 does not know anything about users since there is no /srv/gerrit/git/All-Users.git repository. It thus does not know about my ssh key ( I commented on https://phabricator.wikimedia.org/T326368#8842150 ) [16:26:24] cause the keys are not managed via ldap but in the Gerrit local user [16:26:56] so yeah I think we need to rsync in the repos. I don't think the upgrade will work without at least All-Users and All-Projects (or maybe it will create empty dummy ones which might be okish) [16:27:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:27:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:29:43] (03CR) 10DCausse: [C: 03+1] Update extra plugin to 7.10.2-wmf8 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/912995 (https://phabricator.wikimedia.org/T332355) (owner: 10Ebernhardson) [16:31:39] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [16:31:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:32:03] hashar: I will fix /srv/gerrit/git on gerrit1003 [16:32:24] mutante: <3 [16:32:31] hashar: it was working already.. things just broke during our attempt [16:32:34] on it [16:32:38] yeah no worries [16:33:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P48156 and previous config saved to /var/cache/conftool/dbconfig/20230510-163312-ladsgroup.json [16:34:13] the actual reason it broke is that we synced "gerrit" where it should be "gerrit/gerrit" and then follow-up to that [16:36:08] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:36:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:45:23] hashar: currently copying "cobalt/git/". heh, cobalt :) [16:45:29] yeah [16:45:38] I should spend some time analyziing those legacy repositories [16:45:54] I'd like to keep them around cause they could hold objects we lost at some point in time [16:45:59] I think it's just a backup from the cobalt-gerrit1001 migration that we made "just in case" [16:46:04] yeah [16:46:07] ack [16:46:16] when ever doing a full reindexing there are a few changes missing [16:46:20] good to know they should stay though.. ok [16:46:28] some metadata have the wrong instance-id which might be the issue as well [16:46:34] and we have some old patchsets missing here and there [16:46:42] gotcha [16:46:49] it is a bit tedious to analyze thoroughly though :/ [16:46:57] yea..I can see that [16:47:20] for the switch over, you'd to rsync the lfs data as well [16:47:27] they are not replicated (I found that out this week :/ ) [16:47:59] (03PS6) 10Muehlenhoff: Add a generic Cassandra reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [16:48:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T335845)', diff saved to https://phabricator.wikimedia.org/P48157 and previous config saved to /var/cache/conftool/dbconfig/20230510-164818-ladsgroup.json [16:48:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:48:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:48:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T335845)', diff saved to https://phabricator.wikimedia.org/P48158 and previous config saved to /var/cache/conftool/dbconfig/20230510-164842-ladsgroup.json [16:49:19] hashar: ooh, yea, that's a good point actually since I had already applied on the new host that lfs data is in the new location outside of root [16:49:23] (03CR) 10JHathaway: [C: 03+1] "looks good overall, I agree that a merge then further iteration, if necessary, sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/895356 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [16:49:27] let me double check that too [16:49:45] yeah that is why I wanted to move the lfs data first and have the dir consistent on all hosts :] [16:50:03] so there would need a rsync from some path on gerrit1001 to some other path to gerrit1003 (whatever is the new one) [16:50:16] fair enough. yea, we are on the same page though [16:50:21] I will take care today [16:50:38] I have to find a solution to have the LFS data replicated, but maybe they can be moved to a S3 bucket on Swift (if that is even possible on our infra) [16:50:51] long tail of stuff, it is never ending :-] [16:50:55] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2012.codfw.wmnet with OS bullseye [16:51:03] agreed. there is always more:) [16:51:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [16:51:09] (03PS1) 10Ssingh: Revert "lvs2012: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/918530 [16:51:34] and sorry for missing the wikitech-l announce, had I read that I would have upgraded Gerrit after the switch rather than this morning [16:51:36] hashar: re: swift, check out https://phabricator.wikimedia.org/T336234 [16:51:46] hashar: no problem [16:51:48] jouncebot: next [16:51:48] In 0 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1700) [16:52:48] I am copying gerrit data but with bwlimit 10m.. without any limit it would take down gerrit [16:52:52] (03PS7) 10Muehlenhoff: Add a generic Cassandra reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 [16:52:58] so it's taking time.. I will check in again [16:53:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:06] we don't have QoS policies do we? [16:54:27] or maybe it is too cpu/diskIO heavy [16:54:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [16:54:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye [16:55:38] hashar: last time i did that it was just taking all bandwidth on the server so that there was none left for http server and you got timeouts in browser [16:55:54] not on network gear though afaict [16:56:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T335845)', diff saved to https://phabricator.wikimedia.org/P48159 and previous config saved to /var/cache/conftool/dbconfig/20230510-165601-ladsgroup.json [16:56:34] (03PS1) 10Kimberly Sarabia: Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) [16:56:55] fun [16:57:10] it would be good to know the actual limit though [16:57:21] because this is too slow:) [16:57:47] I wanted you to be able to deploy ... [16:58:03] maybe I should do /srv/gerrit/git first and let that cobalt stuff run over night [16:59:18] (03CR) 10Eevans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/917337 (owner: 10Muehlenhoff) [16:59:33] bathroom break [16:59:36] (03PS1) 10Majavah: trove: bind on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/918570 [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1700) [17:00:58] please note a deploy lock is in effect [17:01:11] if someone is here to deploy, I can lift it but I will need to revert a patch too, so please let me know [17:01:18] (it took a while as the host is not cooperating for the reimage :) [17:01:32] we should really just have: jouncebot: stop [17:02:14] mutante: yeah the long-term plan in case of the LVSes is to fix T334703 [17:02:14] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [17:02:27] but given these hosts need to be provisioned and are sitting, hence the scap lock [17:04:25] yea!:) [17:07:02] PROBLEM - DPKG on stat1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:11:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P48160 and previous config saved to /var/cache/conftool/dbconfig/20230510-171107-ladsgroup.json [17:11:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [17:12:40] (03CR) 10Btullis: "We were discussing the need to upload the spark3 assembly, particularly after we add iceberg support. I mentioned that this patch was read" [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [17:15:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [17:16:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:18:04] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:20:47] arturo: ^^^that's because of openstack.codfw1dev [17:21:37] if you want a manual DNS you have to empty the DNS Name field in https://netbox.wikimedia.org/ipam/ip-addresses/12908/ [17:22:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49994 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:22:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:43] (03PS4) 10Krinkle: eventlogging: remove CentralNoticeTiming [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [17:23:50] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2012.codfw.wmnet with OS bullseye [17:24:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2012.codfw.wmnet with OS bullseye executed w... [17:25:12] 10SRE, 10LDAP-Access-Requests: Request to access motomo - https://phabricator.wikimedia.org/T336422 (10JJMC89) See https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Matomo#Access [17:25:19] hashar: confirmed we have gigabit connection and now syncing 10 times faster than before.. in progress though. I am not sure if you still wait for deploy tonight or it will be the morning [17:25:37] I will do it tonight :) [17:25:50] ok, hold on .. it's using 100mbit now [17:25:58] and started with ./git/ [17:26:00] just poke me when it is done and I will eventually do it at some point later [17:26:03] (03PS1) 10Cwhite: team-sre: add openapi/swagger alerts [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) [17:26:04] ok [17:26:07] it can be done anytime [17:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P48161 and previous config saved to /var/cache/conftool/dbconfig/20230510-172613-ladsgroup.json [17:26:24] at least the data rsync is time saved for tomorrow [17:26:44] but we would need to ensure the switch over does the rsync with data deletion [17:26:49] yea, thing is.. if it has to redo it all.. it took originally 4 fays [17:26:52] days [17:26:57] but with much slower speed [17:27:12] ouch [17:27:21] I ran it once..then again..then again.. closer to the date.. then things were deleted [17:31:44] hashar: /srv/gerrit/git/ is complete. now doing the rest of /srv/gerrit/ [17:33:05] fixed permissions -> gerrit2:gerrit2 [17:35:16] I am doing the upgrade :] [17:36:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10wiki_willy) a:03Jclark-ctr [17:36:34] there is a global lock ... [17:37:09] RECOVERY - DPKG on stat1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:37:36] I guess we would need scap to be taugth to support fine grained locks :D [17:37:50] I will do the upgrade later tonight [17:41:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T335845)', diff saved to https://phabricator.wikimedia.org/P48162 and previous config saved to /var/cache/conftool/dbconfig/20230510-174119-ladsgroup.json [17:41:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [17:41:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [17:41:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T335845)', diff saved to https://phabricator.wikimedia.org/P48163 and previous config saved to /var/cache/conftool/dbconfig/20230510-174143-ladsgroup.json [17:42:31] (03CR) 10Ssingh: [C: 03+2] Revert "lvs2012: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/918530 (owner: 10Ssingh) [17:43:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10KFrancis) Hello all, NDAs for Robert Timm and Loren Johnson have been sent for signatures. I'll confirm when they are complete. I'm waiti... [17:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T335845)', diff saved to https://phabricator.wikimedia.org/P48164 and previous config saved to /var/cache/conftool/dbconfig/20230510-174859-ladsgroup.json [17:49:35] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [17:50:09] (03PS2) 10RLazarus: remote: Clarify wait_reboot_since output [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 [17:50:19] PROBLEM - DPKG on stat1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:50:35] (03CR) 10RLazarus: remote: Clarify wait_reboot_since output (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [17:50:37] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt) Thanks! I have verified that I can ssh to the puppetmaste... [17:51:52] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) I made several edits to the migration plan doc that is transcluded here from https://phabricator.wikimedia.org/P47782. See my comments there for de... [17:54:34] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) >>! In T326368#8842150, @hashar wrote: > ..but can't ssh into it. Turns out it does not have any of the git repositories under /srv/gerrit/git and... [17:58:56] jouncebot: next [17:58:56] In 0 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800) [17:58:56] In 0 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800) [17:59:36] PROBLEM - Host db2139 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:05] hashar and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800). [18:00:05] hashar and brennen: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800). Please do the needful. [18:00:18] brennen: train deployed earlier today [18:01:34] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10BCornwall) There was also mention of replacing pwstore entirely: Any developments on that front/Is there a ticket tracking that? I'd be more in favor of using an existing, well-tested solut... [18:03:28] sukhe: I ran the mediawiki train earlier today so there is nothing to run this hour :) [18:03:59] hashar: ah! thanks :) [18:04:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P48165 and previous config saved to /var/cache/conftool/dbconfig/20230510-180406-ladsgroup.json [18:04:15] jouncebot: nowandnext [18:04:15] For the next 0 hour(s) and 55 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800) [18:04:16] For the next 1 hour(s) and 55 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T1800) [18:04:16] In 1 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T2000) [18:04:37] so I guess you are all set until the UTC backport window in a couple hours [18:04:39] happy upgrades! [18:04:58] yep thanks [18:05:11] going to lock scap again then [18:08:43] (03PS1) 10Jbond: dnsquery: bump to v5.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/918580 [18:10:22] (03CR) 10Ssingh: [C: 03+2] hiera: add dns2004 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/917881 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:12:30] * sukhe on standby if resolv.conf lets us down and brings down all recursors [18:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:13:27] ok [18:15:13] (03PS1) 10Eevans: hierdata: add swift (thanos) mw-event-enrichment account [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) [18:17:13] 10SRE, 10LDAP-Access-Requests: Request to access motomo - https://phabricator.wikimedia.org/T336422 (10Aklapper) 05Open→03Invalid Hi @Fuying, please follow the instructions which lead to a Phabricator form with a list of information to provide. As that form is pre-filled by clicking a link, I am going to r... [18:17:39] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) @hashar @thcipriani status update: I have freshly: - rsynced /srv/gerrit - rsynced /var/lib/gerrit2 - for the lfs path change: copied /srv/gerrit/... [18:19:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P48166 and previous config saved to /var/cache/conftool/dbconfig/20230510-181912-ladsgroup.json [18:19:49] (03PS1) 10Eevans: hierdata: add mw_event_enrichment swift account (thanos) [labs/private] - 10https://gerrit.wikimedia.org/r/918583 (https://phabricator.wikimedia.org/T330693) [18:20:32] (03PS1) 10Jdrewniak: Enable Vector "Zebra" AB test on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918584 (https://phabricator.wikimedia.org/T335972) [18:20:38] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:20:59] (03PS2) 10Ssingh: hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/917365 (https://phabricator.wikimedia.org/T335777) [18:21:13] (03PS2) 10Eevans: hierdata: add swift (thanos) mw-event-enrichment account [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) [18:21:26] (03PS2) 10Ssingh: sites.yaml: remove dns2001 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/917364 (https://phabricator.wikimedia.org/T335777) [18:21:43] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [18:21:48] hashar: see ticket and -releng, you can try. I am done for now and be back later [18:21:52] (03CR) 10Xcollazo: Upload the spark3-assemly file to HDFS on the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [18:21:59] everything synced and copied lfs in place too [18:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:26:11] (03CR) 10Eevans: [C: 03+2] hierdata: add mw_event_enrichment swift account (thanos) [labs/private] - 10https://gerrit.wikimedia.org/r/918583 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [18:26:14] (03CR) 10Eevans: [V: 03+2 C: 03+2] hierdata: add mw_event_enrichment swift account (thanos) [labs/private] - 10https://gerrit.wikimedia.org/r/918583 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [18:26:54] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/918582 (https://phabricator.wikimedia.org/T330693) (owner: 10Eevans) [18:27:09] (03CR) 10BCornwall: [C: 03+1] hiera: decommission dns2001 [puppet] - 10https://gerrit.wikimedia.org/r/917365 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh) [18:29:29] (03CR) 10Xcollazo: "Let's wait for Aqu's input on this one." [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [18:31:26] (03PS1) 10Jdlrobson: Remove unnecessary jQuery closure [extensions/PageTriage] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/918531 (https://phabricator.wikimedia.org/T324913) [18:32:08] (03PS1) 10Andrew Bogott: mwopenstackclients: make use of all_tenants when listing vms in all projects [puppet] - 10https://gerrit.wikimedia.org/r/918585 [18:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T335845)', diff saved to https://phabricator.wikimedia.org/P48167 and previous config saved to /var/cache/conftool/dbconfig/20230510-183418-ladsgroup.json [18:34:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:34:26] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: make use of all_tenants when listing vms in all projects [puppet] - 10https://gerrit.wikimedia.org/r/918585 (owner: 10Andrew Bogott) [18:34:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [18:34:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1218 (T335845)', diff saved to https://phabricator.wikimedia.org/P48168 and previous config saved to /var/cache/conftool/dbconfig/20230510-183441-ladsgroup.json [18:38:28] (03PS1) 10Andrew Bogott: Openstack Neutron: double number of api workers [puppet] - 10https://gerrit.wikimedia.org/r/918586 (https://phabricator.wikimedia.org/T336379) [18:38:42] (03PS2) 10Eevans: ml-cache: upgrade Cassandra to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) [18:40:02] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: double number of api workers [puppet] - 10https://gerrit.wikimedia.org/r/918586 (https://phabricator.wikimedia.org/T336379) (owner: 10Andrew Bogott) [18:40:07] (03CR) 10Eevans: [C: 03+2] ml-cache: upgrade Cassandra to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/917407 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [18:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T335845)', diff saved to https://phabricator.wikimedia.org/P48169 and previous config saved to /var/cache/conftool/dbconfig/20230510-184202-ladsgroup.json [18:42:05] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) Just realized one more thing we have to remember / add to the plan. After we switch we must active replication from gerrit1003 and deactivate it fr... [18:43:29] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Rolling restart to apply Cassandra 3.11.14 upgrade - eevans@cumin1001 [18:45:40] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 191m 53s) [18:45:44] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [18:48:17] (03CR) 10JHathaway: [C: 03+1] dnsquery: bump to v5.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/918580 (owner: 10Jbond) [18:49:51] (03PS2) 10Dzahn: Revert "Revert "gerrit: switch service IP, turn new into current and current into old"" [dns] - 10https://gerrit.wikimedia.org/r/918529 (https://phabricator.wikimedia.org/T326368) [18:50:20] (03PS1) 10Dzahn: gerrit: enable replication from gerrit1003, disable from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/918589 (https://phabricator.wikimedia.org/T326368) [18:51:21] ~.~. [18:51:53] jbond: terminating an ssh session? [18:52:25] urandom: indeed :) [18:52:36] haha [18:53:54] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) steps we have to add: - move/copy lfs data which is in new location on new host (for a good reason) (https://gerrit.wikimedia... [18:54:50] !log milimetric@deploy1002 Started deploy [analytics/refinery@4ccc172]: Regular analytics weekly train [analytics/refinery@4ccc172] [18:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P48170 and previous config saved to /var/cache/conftool/dbconfig/20230510-185710-ladsgroup.json [18:58:51] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10Volans) I tried to ping from `lvs2012` few hosts in row C and all fails, so I think is the connection with the row C that is misconfigur... [18:59:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Jclark-ctr) Dell Service Request 167860985 was successfully submitted. [19:00:06] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Dzahn) >>! In T298194#8842549, @BCornwall wrote: > maintain a custom solution. Without an opinion on the rest of the question, but it's not a custom in-house solution. The upstream is actu... [19:00:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Rolling restart to apply Cassandra 3.11.14 upgrade - eevans@cumin1001 [19:05:07] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) 05In progress→03Resolved Ok, thank you! Cool. At this poin... [19:05:55] 10SRE, 10Infrastructure Security: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10BCornwall) I stand corrected! Thank you :) That said, it seems quite inactive and wasn't exactly full of contributors. [19:08:14] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Rolling restart to apply Cassandra 3.11.14 upgrade - eevans@cumin1001 [19:10:16] (03CR) 10Volans: [C: 03+1] "Nice! One last optional nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [19:11:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:12:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P48171 and previous config saved to /var/cache/conftool/dbconfig/20230510-191216-ladsgroup.json [19:15:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10ssingh) >>! In T336428#8842727, @Volans wrote: > I tried to ping from `lvs2012` few hosts in row C and all fails, so I think is the c... [19:16:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) Thanks KFrancis! Hi WMDE requestors, as was mentioned in T335941#8832551 we needed separate tickets here anyways. > they are comp... [19:21:03] 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10BCornwall) 05Open→03In progress [19:25:52] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Rolling restart to apply Cassandra 3.11.14 upgrade - eevans@cumin1001 [19:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T335845)', diff saved to https://phabricator.wikimedia.org/P48172 and previous config saved to /var/cache/conftool/dbconfig/20230510-192722-ladsgroup.json [19:27:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [19:27:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [19:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T335845)', diff saved to https://phabricator.wikimedia.org/P48173 and previous config saved to /var/cache/conftool/dbconfig/20230510-192746-ladsgroup.json [19:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T335845)', diff saved to https://phabricator.wikimedia.org/P48174 and previous config saved to /var/cache/conftool/dbconfig/20230510-193455-ladsgroup.json [19:35:18] !log milimetric@deploy1002 Finished deploy [analytics/refinery@4ccc172]: Regular analytics weekly train [analytics/refinery@4ccc172] (duration: 40m 28s) [19:36:07] (03CR) 10Aqu: "I would prefer to generate the assembly as an extra artifact from GitlabCI. It allows us to test the jar before it reaches production. And" [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [19:41:11] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) [19:41:55] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) 05Open→03In progress p:05Triage→03Medium out for signature - T335941#8842473 [19:42:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) a:03adee_wmde [19:44:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T336436 (10Dzahn) Oh, sorry, duplicate of ticket T335858 [19:45:27] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T336436 (10Dzahn) 05Open→03Invalid [19:46:10] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) [19:46:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10Dzahn) a:03roti_WMDE [19:46:55] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10Dzahn) p:05Triage→03Medium [19:47:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Robert Timm (WMDE) - https://phabricator.wikimedia.org/T336435 (10Dzahn) 05Open→03In progress out for signature - T335941#8842473 [19:47:14] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw [19:47:42] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) out for signature per T335941#8842473 [19:48:04] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) a:03lojo_wmde [19:48:31] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) a:05darthmon_wmde→03Dzahn [19:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P48175 and previous config saved to /var/cache/conftool/dbconfig/20230510-195001-ladsgroup.json [19:51:46] (03CR) 10Dzahn: [C: 03+1] "looks good to me, as clinic duty, waiting for infra-security to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/918519 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis) [19:54:04] (03PS3) 10RLazarus: remote: Clarify wait_reboot_since output [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 [19:54:10] (03CR) 10RLazarus: remote: Clarify wait_reboot_since output (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [19:56:00] (03PS1) 10Dzahn: microsites: change rewrite rule for https://transparency.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) [19:56:45] 10SRE, 10WMF-Legal, 10serviceops-collab, 10wikimediafoundation.org, 10Patch-For-Review: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [19:57:05] Deloyer: im here for deployment but i will be 5 mins late because I need to pick up my lunch :) [19:57:26] 10SRE, 10WMF-Legal, 10serviceops-collab, 10wikimediafoundation.org, 10Patch-For-Review: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Dzahn) Hi, does this diff look right? https://gerrit.wikimedia.org/r/c/operations/puppet/+/918594/1/modules/profile/templat... [19:59:52] !log milimetric@deploy1002 Started deploy [analytics/refinery@4ccc172] (thin): Regular analytics weekly train THIN [analytics/refinery@4ccc172] [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230510T2000). [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] !log milimetric@deploy1002 Finished deploy [analytics/refinery@4ccc172] (thin): Regular analytics weekly train THIN [analytics/refinery@4ccc172] (duration: 00m 26s) [20:00:31] !log milimetric@deploy1002 Started deploy [analytics/refinery@4ccc172] (thin): Regular analytics weekly train THIN [analytics/refinery@4ccc172] [20:00:37] !log milimetric@deploy1002 Finished deploy [analytics/refinery@4ccc172] (thin): Regular analytics weekly train THIN [analytics/refinery@4ccc172] (duration: 00m 05s) [20:00:40] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10Dzahn) 05Open→03In progress p:05Triage→03High [20:01:43] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS bullseye [20:02:22] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10Dzahn) Not sure if a manager has to say approved on ticket for this or not. +1 to the patch but waiti... [20:03:04] 10SRE, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10Dzahn) [20:04:09] 10SRE, 10Infrastructure-Foundations, 10LDAP: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10Dzahn) [20:04:42] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: Improve alerting for hosts with Puppet disabled for longer periods - https://phabricator.wikimedia.org/T277083 (10Dzahn) [20:05:04] Jdlrobson: you around? i can deploy if so (sorry to be late again) [20:05:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P48176 and previous config saved to /var/cache/conftool/dbconfig/20230510-200508-ladsgroup.json [20:05:10] 10SRE, 10Infrastructure-Foundations, 10Security, 10User-MoritzMuehlenhoff: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683 (10Dzahn) [20:05:22] cjming: around! [20:05:32] (03PS2) 10Jdlrobson: Enable Vector "Zebra" AB test on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918584 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [20:05:34] 10SRE, 10Infrastructure Security, 10User-MoritzMuehlenhoff: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545 (10Dzahn) [20:05:46] In addition to the backport, could you also help me merge this beta cluster only change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918584 ? [20:05:55] sure - np [20:06:07] (03CR) 10Volans: [C: 03+1] "LGTM, final nit on the tests if you've time or I can pick them up tomorrow." [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [20:06:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [extensions/PageTriage] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/918531 (https://phabricator.wikimedia.org/T324913) (owner: 10Jdlrobson) [20:09:02] 10SRE, 10observability, 10serviceops: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10Dzahn) [20:10:22] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team: Track source of packages in reprepro - https://phabricator.wikimedia.org/T105385 (10Dzahn) [20:12:08] (03Merged) 10jenkins-bot: Remove unnecessary jQuery closure [extensions/PageTriage] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/918531 (https://phabricator.wikimedia.org/T324913) (owner: 10Jdlrobson) [20:12:14] 10SRE, 10SRE-tools, 10Spicerack: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180 (10Dzahn) [20:12:38] !log cjming@deploy1002 Started scap: Backport for [[gerrit:918531|Remove unnecessary jQuery closure (T324913)]] [20:12:42] T324913: Curation toolbar fails to load occasionally for pages in the PageTriage queue - https://phabricator.wikimedia.org/T324913 [20:12:53] 10SRE: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:13:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180 (10taavi) Could this be closed in favour of {T268344}? [20:14:12] !log cjming@deploy1002 cjming and jdlrobson: Backport for [[gerrit:918531|Remove unnecessary jQuery closure (T324913)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:14:17] Jdlrobson: wanna test? [20:14:18] 10SRE, 10Infrastructure Security: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671 (10Dzahn) [20:14:35] on it [20:15:23] @cjming looks good to sync [20:15:28] great - syncing [20:16:16] (when you are done I will do a scap deploy for one of the Gerrit host which should not affect anything) [20:17:10] hashar: sounds good - will ping when done here shortly [20:17:43] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops-radar: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176 (10Dzahn) [20:19:21] cjming: can we +2 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918568 now or does that need to wait for the above? [20:20:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T335845)', diff saved to https://phabricator.wikimedia.org/P48177 and previous config saved to /var/cache/conftool/dbconfig/20230510-202014-ladsgroup.json [20:20:34] Jdlrobson: almost done syncing -- i've been advised in the past to scap things sequentially but maybe bec it's labs it doesn't matter? [20:21:40] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:918531|Remove unnecessary jQuery closure (T324913)]] (duration: 09m 02s) [20:21:44] T324913: Curation toolbar fails to load occasionally for pages in the PageTriage queue - https://phabricator.wikimedia.org/T324913 [20:22:56] Jdlrobson: should i do your labs + https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918568 ? [20:23:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918584 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [20:24:27] (03Merged) 10jenkins-bot: Enable Vector "Zebra" AB test on beta cluster. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918584 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [20:25:07] Jdlrobson: not sure what the latency is but labs change should be visible soonish [20:25:21] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) not resolved yet - current status: https://os-reports.wikimedia.org/stretch.html [20:25:36] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) p:05Low→03Medium [20:25:53] thanks cjming ! [20:25:53] 10SRE, 10Infrastructure Security, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [20:26:07] Jdlrobson: np! is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918568 something you still want to deploy today? [20:26:10] yeh but a +2 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/918568 would be great [20:26:20] oh wait [20:26:22] not that one [20:26:24] no just beta [20:26:27] (03CR) 10Jdlrobson: [C: 04-1] Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:26:50] (03CR) 10Jdlrobson: [C: 04-1] "(to be deployed week of 15th)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:26:53] Jdlrobson: ok, beta one is done [20:27:34] gonna close the window if there's nothing else? [20:27:40] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10Dzahn) [20:28:53] 10SRE, 10Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10Dzahn) [20:29:36] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Dzahn) @Aklapper Wanna bring it up with the new security lead? [20:32:38] taking radio silence as consent (which i would never do in any other circumstance because time) [20:32:46] !log end of UTC late backport window [20:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:51] hashar: all yours [20:33:09] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Release-Engineering-Team (Seen): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826 (10Dzahn) [20:33:12] amazing. Thank you :) [20:33:37] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e815301]: Gerrit to 3.5.6 on gerrit1003 | T336339 [20:33:41] T336339: Upgrade Gerrit from 3.5.5 to 3.5.6 - https://phabricator.wikimedia.org/T336339 [20:33:43] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e815301]: Gerrit to 3.5.6 on gerrit1003 | T336339 (duration: 00m 06s) [20:35:06] (03PS1) 10Bking: [WIP]wdqs: Activate wdqs2021 [puppet] - 10https://gerrit.wikimedia.org/r/918597 (https://phabricator.wikimedia.org/T321605) [20:36:26] mutante: thcipriani: gerrit1003 now has Gerrit 3.5.6 \o/ [20:37:36] hashar: yay!:) thanks [20:38:08] confirmed 3.5.6. why no repos listed remains to be seen / separate [20:38:22] (03PS2) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [20:38:29] added a step to the plan to copy lfs data, fwiw [20:38:54] the repos are there https://gerrit-new.wikimedia.org/r/admin/repos [20:39:13] oh, good! ok, it seems to be caching [20:39:40] Gerrit stores all the information in git repositories which are terribly slow to browse though [20:40:02] so the infos are processed and indexed with Lucene [20:40:13] and on top of that there are memory/disk caches [20:40:18] can you see an individual change link? [20:40:25] ok, ack, yea [20:41:18] they are /var/lib/gerrit2/review_site/index [20:41:19] yes, confirmed. I can see individual changes if I edit URL to insert the "new" [20:41:48] well, great! lgtm [20:42:33] I will take a break and not touch it for now. rsync diff tomorrow will be fast enough now that it's 100m [20:42:36] and if there is nothing on https://gerrit-new.wikimedia.org/r/q/is:open .. well something is severly broken [20:43:41] the thing is.. this worked before and see no reason for a change [20:44:39] running "gerrit index changes" is part of the plan though [20:45:12] that takes like 2 hours iirc [20:45:58] heh [20:46:11] it does for all changes, but it's an online command, so it'll gradually (re)populate things [20:46:24] cant hurt to run it now, can it? [20:46:28] and instance will be functional [20:46:44] no, probably wouldn't hurt anything, might get clobbered? [20:47:00] also, we forgot about replication today [20:47:05] oh good [20:47:08] (03PS3) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [20:47:11] we needed a step where we tell it that now 1003 is the source [20:47:16] of repl to 2002 and guthub [20:47:18] github [20:47:48] yeah, something in the template for the replication config: i.e., don't populate this file unless you're the primary [20:47:52] (03CR) 10CI reject: [V: 04-1] Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [20:48:00] (03PS4) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [20:48:21] thcipriani: https://gerrit.wikimedia.org/r/c/operations/puppet/+/918589 [20:48:38] based on host name right now.. well... [20:48:51] ah found it. /var/lib/gerrit2/review_site/index/changes_0071 is empty [20:49:02] that was the recent bug when old and new gerrit were both fighting over replication [20:49:12] (03PS5) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [20:49:13] oh? [20:49:22] how so.. hmm [20:49:49] bad luck because sync happened without stopping service on the source ? [20:50:09] (03PS6) 10TChin: Add flink-app default log config and use it in page_content_change [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) [20:50:42] I don't know [20:50:55] I do see for example ./open/write.lock inside that dir [20:51:07] (03CR) 10TChin: Add flink-app default log config and use it in page_content_change (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [20:51:11] so it's not really empty dir [20:51:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10cmooney) Definitely odd. I can ping fine with v6 (default) from cumin2002 as of now: ` cmooney@cumin2002:~$ ping lvs2012 PING lvs201... [20:52:03] it is empty in the sense there is no data :] [20:52:16] anyway, one would have to rsync /var/lib/gerrit2/review_site/index/changes_0071 [20:52:27] well, I should repeat the sync of /var/lib/gerrit2 then.. but i dont get it [20:52:34] and most certainly do that after Gerrit got stopped on gerrit1001 [20:52:45] yea, so that is part of the normal plan [20:52:53] just not happening during "pre-sync" [20:52:55] so far [20:53:07] same for caches,they should be synced after Gerrit got stopped [20:53:10] which was just to make it faster during migration [20:53:20] yea, that's going to happen [20:53:37] so the question is just if we want to do that before as well [20:53:45] because downtime [20:53:47] and I don't know why the changes_071 Lucene index data did not get rsynced :/ [20:54:24] (and I should probably remove the old obsolete indices cause I don't think we will downgrade) [20:54:26] maybe it was still reindexing while the sync happened [20:54:40] dont have another explanation [20:54:47] did the whole tree [20:55:24] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@02d6ac9]: (no justification provided) [20:55:35] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@02d6ac9]: (no justification provided) (duration: 00m 11s) [20:55:44] yeah, it's interesting the review_site/index/gerrit_index.config points to changes_0071 as ready....but it's not :) [20:55:57] guess this is just a fragment of the sync ordering [20:56:24] a fresh rsync of the /var/lib part would be fast now [20:58:23] yeah, let's see if that fixes it [20:58:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [20:58:57] I am off to sleep. Happy afternoon! [20:59:08] hashar: good night [20:59:16] thcipriani: but with taking down gerrit1001? [20:59:56] I can just do it first without stopping the service though [21:03:16] yeah, don't want to take down gerrit1001. Those files seem to get locked for write every few minutes (so whenever a new change comes in), but if you catch it between writes should Just Work™ [21:03:27] syncing.. hold on [21:03:48] also I have to add this: rsync -avp /srv/gerrit/plugins/lfs/ /srv/gerrit/data/lfs/ on 1003 [21:04:17] I see the "changes_0071" files being copied already [21:04:33] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BCornwall) 05In progress→03Resolved Resolving since it appears to... [21:05:41] and of course this angers the jvm on gerrit1003 :) [21:06:07] just for a moment.. [21:07:05] thcipriani: done. synced, synced lfs.. chown ... [21:07:12] service restart? [21:07:43] sure, right, the chown is what angers the jvm [21:07:58] restarting on 1003 [21:08:35] hrm, still no changes [21:08:37] ..and it's like before [21:08:49] sad.. because it was _already working_ [21:08:57] what changed [21:09:05] just the gerrit version? [21:09:34] not even that since it was like this before and after deploy [21:10:10] is it because /srv/deployment/gerrit is messed up on 1003 [21:10:16] and plugins links into that [21:10:58] heh, looks like there's another write.lock in the indexes directory [21:11:08] thcipriani: /srv/deployment on 1001 is 1.1G but on 1003 it is merely 641M [21:11:12] (and that's it [21:11:19] oh [21:12:17] I could try starting the indexer, just to verify that's it [21:12:34] yes let's do that [21:12:45] per the "takes 2 hours" [21:13:56] I am also going to kill the rsync daemon process on 1003 and let puppet restart it... hmmmmm [21:14:12] you know..because I edited the config once [21:15:08] thcipriani: oh, now look at this: [21:15:16] puppet changes permissions to some files [21:15:21] under /var/lib/gerrit2 [21:15:32] in /plugins/ [21:16:21] and another thing! puppet removed the github remote from /var/lib/gerrit2/review_site/etc/replication.config ! [21:17:28] thcipriani: I see my changes now! [21:17:29] starting the indexer started populating changes [21:17:46] I see things here now: https://gerrit-new.wikimedia.org/r/q/status:open+-is:wip [21:18:24] thcipriani: can you check if it is trying to replicate [21:18:33] (and there are like 2,600 indexing tasks in the queue, we'll see how long it takes to work through them) [21:18:57] so what happens is.. when I rsync.. the config is also synced that includes the replication to 2002 [21:19:04] then whenever puppet runs next.. it removes it [21:19:22] it's possible it already started doing it again [21:21:40] it looks like it's got nothing in the queue for replication afaict [21:21:53] ok, great! [21:22:49] happy that https://gerrit-new.wikimedia.org/r/q/status:open+-is:wip works for me.. just https://gerrit-new.wikimedia.org/r/dashboard/self doesn't seem to yet.. [21:26:03] works for me (not all the same tasks, but it's not done indexing yet) [21:26:04] thcipriani: it just started working :) [21:26:08] ack [21:26:16] dashboard self did [21:27:18] ok, not so worried anymore now.. [21:29:31] or back to the base amount of worried, anyway [21:31:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS buster [21:31:52] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS buster [21:32:13] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS buster [21:34:29] (also, indexing is now done) [21:48:52] cool. it seems fine to me. going afk for now [21:49:35] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [21:52:57] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [22:06:54] (03PS4) 10RLazarus: remote: Clarify wait_reboot_since output [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 [22:07:21] (03CR) 10RLazarus: remote: Clarify wait_reboot_since output (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/918000 (owner: 10RLazarus) [22:08:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2021.codfw.wmnet with OS buster [22:13:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:16:10] 10SRE-OnFire, 10Observability-Metrics, 10Sustainability (Incident Followup), 10User-fgiunchedi: ThanosCompactHalted error on overlapping blocks - https://phabricator.wikimedia.org/T335406 (10andrea.denisse) [22:21:42] 10SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10andrea.denisse) [22:23:33] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:24:18] 10SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10andrea.denisse) a:03andrea.denisse [22:29:11] 10SRE-OnFire, 10Observability-Alerting, 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10andrea.denisse) [22:39:41] (03PS1) 10Dwisehaupt: Add dns for new frack codfw bastion [dns] - 10https://gerrit.wikimedia.org/r/918608 (https://phabricator.wikimedia.org/T334505) [22:54:27] 10SRE, 10Security-Team, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Aklapper) @dzahn: I myself don't plan to, as [I'm not a fan of GPG](https://latacora.singles/201... [22:57:58] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10Jhancock.wm) Dell reached out after the two tickets self-dispatch tickets were escalated. I've submitted a TSR report for this one. [23:14:53] (03PS1) 10BryanDavis: {bullseye,buster}-sssd: add openssh-client [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/918610 (https://phabricator.wikimedia.org/T258841) [23:26:21] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10Jhancock.wm) the error reoccured today. but a Dell TSR report was submitted and a new part is most likely on the way. [23:52:07] (03PS1) 10Aaron Schulz: Remove innodb_lock_wait_timeout from the DatabaseMysqli SET statement in open() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918612