[00:13:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214 [00:38:31] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214 (owner: 10TrainBranchBot) [00:40:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215 [00:40:35] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215 (owner: 10TrainBranchBot) [00:42:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949214 (owner: 10TrainBranchBot) [00:55:02] (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:55:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/949215 (owner: 10TrainBranchBot) [01:27:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:44] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:44] !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided) [01:54:05] !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 20s) [01:55:39] !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided) [01:55:58] !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 19s) [01:58:45] !log fab@deploy1002 Started deploy [airflow-dags/research@ff0a21b]: (no justification provided) [01:59:07] !log fab@deploy1002 Finished deploy [airflow-dags/research@ff0a21b]: (no justification provided) (duration: 00m 22s) [02:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:07] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:11:39] (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:08] PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:31:39] (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:40] RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:46:35] (03PS1) 10Andrew Bogott: wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 [02:48:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:05] (03CR) 10CI reject: [V: 04-1] wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 (owner: 10Andrew Bogott) [02:50:01] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:52:40] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T344394 (10phaultfinder) [03:28:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:42] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:46] PROBLEM - MegaRAID on an-worker1112 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:44:30] (03PS1) 10Stang: zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) [03:44:32] (03PS1) 10Stang: zhwiki: Remove abusefilter-(log|view)-private from rollbacker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949613 (https://phabricator.wikimedia.org/T344398) [03:56:46] RECOVERY - MegaRAID on an-worker1112 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:13:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:27:33] (03PS1) 10KartikMistry: Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683) [04:29:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:33:38] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:02] (ConfdResourceFailed) firing: (96) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:30:26] 10SRE, 10PyBal, 10Scap, 10Traffic, and 3 others: High rate of errors and increased latency on uncached MediaWiki requests due to infrastructure outage - https://phabricator.wikimedia.org/T337497 (10Joe) 05Open→03Resolved a:03Joe The problem that caused this outage has been fixed. [05:33:22] 10SRE-OnFire, 10Incident Tooling, 10Patch-For-Review, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10CodeReviewBot) oblivian merged https://gitlab.wikimedia.org/repos/sre/vopsbot/-/merge_requests/11 Allow users to be part of multiple teams [05:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:48:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957) (owner: 10Cwhite) [05:52:44] 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10MoritzMuehlenhoff) [05:52:46] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) 05Open→03Declined The server will be decommissioned with the rest old old-esams via T343957 [05:54:54] (03PS1) 10Ayounsi: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) [05:55:25] (03CR) 10CI reject: [V: 04-1] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [05:56:03] (03PS2) 10Ayounsi: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0600) [06:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0600). Please do the needful. [06:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:04:08] (03CR) 10Stevemunene: datahub: Enable OIDC to idp_test (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [06:07:08] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:12:30] (03PS1) 10Dreamy Jazz: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403) [06:13:15] (03PS1) 10Dreamy Jazz: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403) [06:17:01] (03PS1) 10Ayounsi: old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) [06:25:05] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [06:25:38] (03CR) 10Ayounsi: [C: 03+2] old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [06:26:10] (03Merged) 10jenkins-bot: old esams cleanup [homer/public] - 10https://gerrit.wikimedia.org/r/949623 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [06:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:31:39] (JobUnavailable) firing: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:57] (03CR) 10Muehlenhoff: [C: 03+1] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [06:39:51] (03CR) 10Muehlenhoff: [C: 03+2] Make install3003 the new install server for esams [puppet] - 10https://gerrit.wikimedia.org/r/949552 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [06:42:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:48:03] (03PS1) 10Muehlenhoff: Add new dummy keytab for install3003 [labs/private] - 10https://gerrit.wikimedia.org/r/949624 [06:49:26] (03PS2) 10Muehlenhoff: Add new dummy keytab for install3003 and remove install3002 [labs/private] - 10https://gerrit.wikimedia.org/r/949624 [06:50:01] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:59:04] <_joe_> !log updated vopsbot on the icinga hosts T344316 [06:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:08] T344316: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 [07:00:06] Amir1, apergos, and jnuche: (Dis)respected human, time to deploy UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T0700). Please do the needful. [07:00:06] aanzx: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:58] morning! there are no trainees signed up for today's morning backport window. I do see two patches, one a cherry pick from a patch not yet +2'ed, so I have concerns about that ( unsure of irc handle? ) and the other labelled as a wip and needing code review, so I have concerns about thta too ( aanzx ) [07:01:39] apergos: marked it as active [07:02:18] (03CR) 10JMeybohm: [C: 03+1] miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [07:02:51] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T344405 (10ppenloglou) [07:03:15] aanzx: thanks. we don't code review patches here initially, we just +2 them for deployment after a review as a general rule [07:03:22] so that still needs to be done for your config change [07:03:31] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10ppenloglou) [07:04:16] if anyone knows the other patch owner (User:Dreamy Jazz) on irc, please speak up now, I don't know their nick here [07:04:30] Dreamy_Jazz: ^^ [07:04:49] \o [07:04:50] I just updated deployment patch (It had no name) [07:04:57] Apologies. Didn't hear a ping. [07:05:32] (03PS1) 10Muehlenhoff: Point the esams webproxy to install3003 [dns] - 10https://gerrit.wikimedia.org/r/949628 (https://phabricator.wikimedia.org/T344355) [07:05:40] Dreamy_Jazz: your patch set for deployment is cherry picked from a patch without code review in master, so you need to get that sorted, or someone does, before deployment [07:06:10] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10ppenloglou) Hey @Marostegui ! Just an FYI I've made a [[ https://phabricator.wikimedia.org/T344405 | new request ]] here, relevant to what we did in this request. [07:06:23] Was hoping to find zabe and/or taavi today at wikimania to get it merged, but didn't see them [07:06:37] The issue is currently breaking Special:Investigate [07:06:43] (03CR) 10Anzx: [C: 03+1] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [07:06:54] (03CR) 10Ayounsi: [C: 03+2] esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [07:07:27] (03Merged) 10jenkins-bot: esams: update dhcp_server [homer/public] - 10https://gerrit.wikimedia.org/r/949621 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [07:07:35] (03CR) 10Muehlenhoff: [C: 03+2] Point the esams webproxy to install3003 [dns] - 10https://gerrit.wikimedia.org/r/949628 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [07:08:11] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:08:54] Dreamy_Jazz: it will have to wait, there is a later window if you can't find them yet [07:09:11] I've just seen taavi. Will ask them now. [07:09:39] ok! [07:10:10] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T335353 (10Marostegui) Thanks - it will be processed by our [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty | clinic duty ]] assigned person this week :) [07:10:40] (03CR) 10JMeybohm: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [07:11:17] taavi has +2'd the patch [07:11:27] apergos: (for the above) [07:11:34] ok [07:12:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on install3002.wikimedia.org with reason: decom in progress [07:12:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on install3002.wikimedia.org with reason: decom in progress [07:12:42] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=756bda9d-0fe5-407f-8e34-35d788d9ab8c) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and... [07:13:04] (03CR) 10Minato826: [C: 03+1] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [07:15:48] apergos: are you deploying the backports or should I? [07:16:05] I was going to but feel free to do the one for Dreamy_Jazz if you like [07:16:10] taavi: [07:16:19] will do, given they're sitting right next to me [07:16:28] oh :-D sure then [07:16:53] (03CR) 10Majavah: [C: 03+2] Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz) [07:16:55] (03CR) 10Majavah: [C: 03+2] Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz) [07:17:06] just lemme know when you are done, I am looking at aanzx's patch now [07:17:51] or I can do that while the checkuser patches are still in CI [07:18:05] aanzx: still around? [07:19:29] (03PS1) 10Majavah: Set WRITE_BOTH for OAuth multiple devices to checkuserwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949629 (https://phabricator.wikimedia.org/T242031) [07:20:38] taavi: [07:20:47] let me dm you [07:22:41] so just to clarify for people following along [07:22:44] (03CR) 10Jelto: [C: 03+2] miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [07:23:05] the job of a backport window runner is to get a patch out to production after it has gone through code review [07:23:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install3003.wikimedia.org [07:23:21] while this might seem like a formaility, it's there for a reason [07:23:33] (03Merged) 10jenkins-bot: miscweb: add wikiworkshop and reasearch-landing-page to eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/948998 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [07:24:00] hi, is there still place for a config change? [07:24:10] and we do want the review to be meaningful, so self-review or review from someone who hasn't ever reviewed things, just to get the +1 on there, doesn't pass the bar. this is not red tape for its own sake, we're just trying to ... [07:24:27] be good about having code vetted before it goes live. thanks! [07:24:34] koi: yes, let's see it! [07:26:18] apergos, added to the calendar [07:26:48] apergos: it's always good to be confident about what you're deploying [07:27:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install3003.wikimedia.org [07:27:51] koi: anybody that can +1 that? [07:28:07] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:29:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:30:10] apergos: I can check and +1 in a few minutes [07:30:20] I'm just reading the task. It's a permissions change. [07:30:30] I mean it's straightforward enough [07:30:51] koi: consensus? [07:31:01] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:31:03] apergos: I don't see any community discussion linked though. [07:31:07] there's a link in the description [07:31:47] (03CR) 10RhinosF1: [C: 03+1] "code wise fine, please add consensus link to task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [07:31:58] (03Merged) 10jenkins-bot: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949574 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz) [07:32:01] (03Merged) 10jenkins-bot: Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate [extensions/CheckUser] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949573 (https://phabricator.wikimedia.org/T344403) (owner: 10Dreamy Jazz) [07:32:02] !log gehel@cumin1001 conftool action : set/pooled=no; selector: name=cloudelastic1006.wikimedia.org [07:32:03] Ah yes at bottom [07:32:17] thanks a lot RhinosF1 [07:32:20] !log restarting elasticsearch on cloudelastic1006 (high GC) [07:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:38] koi: that looks to exist [07:32:42] So I'm happy [07:32:44] apergos: [07:33:00] !log taavi@deploy1002 Started scap: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]] [07:33:03] T344403: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'wikidatawiki' - https://phabricator.wikimedia.org/T344403 [07:34:43] !log taavi@deploy1002 dreamyjazz and taavi: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible vi [07:34:43] a k8s-experimental XWD option) [07:35:28] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install3002.wikimedia.org [07:35:36] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:36:03] !log taavi@deploy1002 dreamyjazz and taavi: Continuing with sync [07:36:03] Test complete. [07:36:39] (JobUnavailable) resolved: (2) Reduced availability for job trafficserver-text in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:37:08] (ProbeDown) resolved: (14) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:37:19] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:37:22] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fae8ef40280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wiki [07:37:22] imedia.org/wiki/Search%23Administration [07:38:15] ^ restart in progress (taking longer than expected), should resolve in a few minutes [07:38:52] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 844, active_shards: 1396, relocating_shards: 0, initializing_shards: 30, unassigned_shards: 185, delayed_unassigned_shards: 0, number_of_pending_tasks: 6, number_of_in_ [07:38:53] etch: 0, task_max_waiting_in_queue_millis: 8341, active_shards_percent_as_number: 86.6542520173805 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:39:15] taavi: are you done scapping things around? can I proceed with aanzx's patch? [07:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:39:24] RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:26] !log gehel@cumin1001 conftool action : set/pooled=yes; selector: name=cloudelastic1006.wikimedia.org [07:39:32] scap's still working on it, just a moment [07:39:39] ah, my bad, no worries [07:40:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:40:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:31] er not aanzx's patch, my bad, that would be koi's patch, coming up next (and last) [07:44:26] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:949573|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]], [[gerrit:949574|Use UserIdentity::LOCAL in PreliminaryCheckService when appropriate (T344403)]] (duration: 11m 25s) [07:44:30] T344403: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to the local wiki, but it belongs to 'wikidatawiki' - https://phabricator.wikimedia.org/T344403 [07:45:29] apergos: now done [07:45:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:45:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:45:36] sweet! [07:45:46] koi, proceeding with your patch now [07:45:56] got it [07:46:06] (03CR) 10ArielGlenn: [C: 03+2] zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [07:46:45] (03Merged) 10jenkins-bot: zhwiki: Create abusefilter-helper group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949612 (https://phabricator.wikimedia.org/T344398) (owner: 10Stang) [07:47:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:47] !log ariel@deploy1002 Started scap: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]] [07:47:50] T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398 [07:48:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:48:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:48:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts install3002.wikimedia.org [07:48:23] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install3002.wikimedia.org` - install3002.wikimedia.org (**FAIL**) -... [07:49:37] !log ariel@deploy1002 stang and ariel: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:49:52] koi: please test your change on mwdebug1002 [07:49:57] looking [07:49:58] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [07:51:22] apergos, i tested in https://zh.wikipedia.org/wiki/Special:Listgrouprights and it looks good [07:51:38] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:40] proceeding [07:51:44] !log ariel@deploy1002 stang and ariel: Continuing with sync [07:53:43] (03PS5) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [07:57:02] watching php-fpm restarts is the new zuul-watching [07:57:31] (03PS6) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [07:57:40] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [07:58:10] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:58:11] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:59:06] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:949612|zhwiki: Create abusefilter-helper group (T344398)]] (duration: 11m 18s) [07:59:10] T344398: Create abusefilter helper group on zhwiki - https://phabricator.wikimedia.org/T344398 [07:59:22] koi: your change is live in production, please test it there [07:59:35] (03PS7) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [08:00:05] apergos, it works well, thanks! [08:00:19] great! and with that, today's backport window comes to a close [08:01:39] !log UTC morning backport and config window done [08:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:03:13] (03PS8) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [08:03:44] (03PS9) 10Gehel: [WIP] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [08:04:09] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [08:04:41] * kart_ updating cxserver [08:05:25] (03CR) 10Muehlenhoff: [C: 03+2] "Acked by Valentín on IRC; I'll go ahead and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/949558 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [08:05:52] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [08:06:41] (03Merged) 10jenkins-bot: Update cxserver to 2023-08-14-091804-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/949619 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [08:07:15] 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Clement_Goubert) 05In progress→03Resolved [08:07:24] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:07:46] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:08:36] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3002.esams.wmnet [08:11:22] (03CR) 10Clément Goubert: [C: 03+2] admin: add fab to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/948693 (https://phabricator.wikimedia.org/T343957) (owner: 10Cwhite) [08:13:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:13:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert Patch merged, the access should be deployed by puppet in the next half-hour. Boldly resolving, feel... [08:14:03] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:14:36] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:14:47] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:15:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:16:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:16:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:16:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ncredir3002.esams.wmnet [08:16:33] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3002.esams.wmnet` - ncredir3002.esams.wmnet (**FAIL**) - Dow... [08:16:52] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:16:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Clement_Goubert) @Mabualruz The out of band verification of your SSH public key is still required as well. [08:17:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ncredir3001.esams.wmnet [08:17:24] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:18:30] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [08:19:41] (03CR) 10Jbond: [C: 03+2] release: add additional instructions [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/949527 (owner: 10Jbond) [08:21:26] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Clement_Goubert) @lojo_wmde I can see L3 has been signed, however we still need your public SSH key, both here on the ticket and on your wikitech user page for out-of-band verific... [08:21:37] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:23:35] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus3002.esams.wmnet [08:23:44] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [08:24:16] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:24:47] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [08:25:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:25:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:25:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir3001.esams.wmnet [08:25:45] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ncredir3001.esams.wmnet` - ncredir3001.esams.wmnet (**PASS**) - Dow... [08:28:00] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [08:29:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3001.esams.wmnet [08:29:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:29:19] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus3002.esams.wmnet [08:29:30] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `prometheus3002.esams.wmnet` - prometheus3002.esams.wmnet (**FAIL*... [08:31:22] (03PS1) 10Muehlenhoff: Remove Puppet references for ganeti3001-3003 [puppet] - 10https://gerrit.wikimedia.org/r/949836 (https://phabricator.wikimedia.org/T344363) [08:31:49] !log Updated cxserver to 2023-08-14-091804-production (T336683, T343211) [08:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:55] T336683: Enable MinT support for languages with no Wikipedia yet - https://phabricator.wikimedia.org/T336683 [08:31:55] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [08:31:59] (Sorry, forgot to log ^^ earlier) [08:33:00] 10SRE-OnFire, 10Incident Tooling, 10User-Joe: vopsbot incorrectly handles users with multiple teams - https://phabricator.wikimedia.org/T344316 (10Joe) 05Open→03Resolved [08:33:06] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) [08:33:15] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10fgiunchedi) [08:34:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:38:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:40:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3001.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:40:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3001.esams.wmnet [08:41:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:09] !log filippo@cumin1001 START - Cookbook sre.ganeti.makevm for new host prometheus3003.esams.wmnet [08:41:10] !log filippo@cumin1001 START - Cookbook sre.dns.netbox [08:41:55] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3002.esams.wmnet [08:42:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for ganeti3001-3003 [puppet] - 10https://gerrit.wikimedia.org/r/949836 (https://phabricator.wikimedia.org/T344363) (owner: 10Muehlenhoff) [08:43:12] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3003.esams.wmnet - filippo@cumin1001" [08:43:37] (03PS1) 10Filippo Giunchedi: Out with prometheus3002, in with prometheus3003 [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355) [08:43:56] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus3003.esams.wmnet - filippo@cumin1001" [08:43:57] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:43:57] !log filippo@cumin1001 START - Cookbook sre.dns.wipe-cache prometheus3003.esams.wmnet on all recursors [08:44:00] !log filippo@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus3003.esams.wmnet on all recursors [08:44:20] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3003.esams.wmnet - filippo@cumin1001" [08:44:25] (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [08:44:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:22] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus3003.esams.wmnet - filippo@cumin1001" [08:45:36] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host prometheus3003.esams.wmnet with OS bullseye [08:45:41] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [08:45:51] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye [08:47:24] (03PS1) 10Filippo Giunchedi: wmnet: use prometheus3003 in esams [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355) [08:47:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:49:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:51:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3002.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:51:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3002.esams.wmnet [08:51:51] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti3003.esams.wmnet [08:53:40] (03CR) 10Clément Goubert: "Questions inline" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [08:54:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:44] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) SSH key confirmed through https://wikitech.wikimedia.org/wiki/User:Panagiotis_Penloglou No group membership necessary as per T335353 To be completely... [08:55:46] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:55:56] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) [08:55:59] (03PS1) 10Clément Goubert: admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405) [08:58:25] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) 05Open→03In progress [09:00:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi) [09:00:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, sql_labm SSH key entry, Kerberos Principal, Team Shell (posix) membership for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) 05Open→03In progress [09:01:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10Clement_Goubert) 05Open→03In progress [09:01:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) 05Open→03In progress [09:01:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi) [09:02:22] (03CR) 10Filippo Giunchedi: [C: 03+2] Out with prometheus3002, in with prometheus3003 [puppet] - 10https://gerrit.wikimedia.org/r/949837 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi) [09:02:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:03:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti3003.esams.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:03:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:03:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti3003.esams.wmnet [09:04:57] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: use prometheus3003 in esams [dns] - 10https://gerrit.wikimedia.org/r/949838 (https://phabricator.wikimedia.org/T344355) (owner: 10Filippo Giunchedi) [09:09:56] (03PS1) 10Muehlenhoff: netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841 [09:10:22] (03PS8) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) [09:10:50] (03Abandoned) 10Jbond: role::puppetserver: Add config master [puppet] - 10https://gerrit.wikimedia.org/r/937518 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:11:07] (03CR) 10Jbond: [C: 03+2] puppetmaster: stop creating the volatile/misc folder [puppet] - 10https://gerrit.wikimedia.org/r/949554 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:11:28] (03CR) 10Filippo Giunchedi: [C: 03+1] netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841 (owner: 10Muehlenhoff) [09:12:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:58] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Disable ganeti sync for old esams cluster [puppet] - 10https://gerrit.wikimedia.org/r/949841 (owner: 10Muehlenhoff) [09:13:58] (03PS1) 10Jelto: trafficserver: switch wikiworkshop.org and research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/949842 (https://phabricator.wikimedia.org/T334511) [09:19:11] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus3003.esams.wmnet with reason: host reimage [09:19:53] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:20:39] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:22:32] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus3003.esams.wmnet with reason: host reimage [09:22:41] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001" [09:22:42] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 16 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [09:23:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001" [09:23:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:23:50] (03PS1) 10Jbond: P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) [09:24:11] (03PS1) 10Jelto: miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) [09:24:28] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:24:42] (03PS2) 10Jbond: P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) [09:26:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42906/console" [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:26:58] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001" [09:27:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for ganeti01.svc.esams.wmnet - cmooney@cumin1001" [09:27:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:50] 10sre-alert-triage, 10Data-Platform-SRE: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (10gmodena) That’s a false positive, we don’t have active traffic in codfw. There was WIP to fix it before I went on PTO but I guess i... [09:28:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1106.eqiad.wmnet with OS bullseye [09:29:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1107.eqiad.wmnet with OS bullseye [09:30:45] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@ff0a21b]: (no justification provided) [09:31:08] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@ff0a21b]: (no justification provided) (duration: 00m 22s) [09:32:07] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:59] looking [09:34:49] _joe_: shall we wait a bit on this alert? it is on esams [09:35:15] !log temporarily pooling kartotherian on codfw [09:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:04] yes I thought site=esams had been silenced, it wasn't [09:36:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:config-master: proixy_sha1 variable needs to be added to vhost_settings [puppet] - 10https://gerrit.wikimedia.org/r/949844 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:36:16] effie _joe_ that's the prometheus host coming online, I'll silence [09:36:32] cool godog tx [09:36:36] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus3003.esams.wmnet with OS bullseye [09:36:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host prometheus3003.esams.wmnet [09:36:55] <_joe_> effie: I still didn't get the pages... [09:37:05] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host prometheus3003.esams.wmnet with OS bullseye completed: - promethe... [09:37:34] <_joe_> sorry if I didn't react earlier [09:37:42] _joe_: it has not been 5' already so [09:37:54] <_joe_> effie: well you were paged right? [09:38:02] <_joe_> as in the page was delivered to you [09:39:16] (03PS1) 10Hnowlan: rest-gateway: add trafficserver-side mangling to rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/949846 (https://phabricator.wikimedia.org/T344358) [09:39:45] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [09:40:13] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [09:40:26] <_joe_> oh wait, this did not send a page to victorops [09:40:34] (03PS1) 10Jbond: config-master: add proxy modules to httpd [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) [09:40:38] <_joe_> !incidents [09:40:39] 3951 (ACKED) [12x] ProbeDown sre (probes/service esams) [09:40:51] <_joe_> yep that is indeed old [09:42:13] _joe_: no, I saw irc [09:42:32] <_joe_> oh, ok. [09:42:32] (03CR) 10Jbond: [C: 03+2] config-master: add proxy modules to httpd [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:43:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1106.eqiad.wmnet with reason: host reimage [09:44:04] (03CR) 10David Caro: [V: 03+1 C: 03+2] p:tlsproxy::envoy: pass through the ensure option [puppet] - 10https://gerrit.wikimedia.org/r/949530 (https://phabricator.wikimedia.org/T344242) (owner: 10David Caro) [09:46:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1106.eqiad.wmnet with reason: host reimage [09:46:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42910/console" [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:48:29] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42911/console" [puppet] - 10https://gerrit.wikimedia.org/r/949847 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [09:49:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10ppenloglou) Thank you @Clement_Goubert ! I'll let Danny know but he's OOO this week, so we can pick this up next week ;) [09:51:42] (03PS1) 10Urbanecm: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) [09:55:22] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324) [09:55:37] (03PS1) 10Ayounsi: Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) [09:57:41] (03PS1) 10David Caro: toolsdb: add skipped table to the config [puppet] - 10https://gerrit.wikimedia.org/r/949854 (https://phabricator.wikimedia.org/T344411) [09:58:33] (03PS1) 10Jbond: configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) [09:58:46] (03CR) 10CI reject: [V: 04-1] configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1000). [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1000) [10:00:27] effie _joe_ I'll resolve incident 3951 since it is acked only and will re-page if left alone [10:00:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42913/console" [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [10:00:38] godog: <3 [10:00:45] <_joe_> godog: yeah I assumed it would be solved today [10:01:11] yeah defo not [10:01:44] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079) [10:02:12] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079) [10:02:53] (03CR) 10Clément Goubert: [C: 03+1] Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [10:04:08] (03CR) 10Effie Mouzeli: [C: 03+1] "The cassandra script looks ok, but take this with a grain of salt" [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [10:05:43] (03PS5) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) [10:07:07] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [10:07:15] (03CR) 10EoghanGaffney: [C: 03+1] trafficserver: switch wikiworkshop.org and research.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/949842 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [10:07:18] (03CR) 10Effie Mouzeli: [C: 03+2] Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:07:42] (03PS2) 10Jbond: configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) [10:07:50] (03Merged) 10jenkins-bot: tegola-vector-tiles: update application image [deployment-charts] - 10https://gerrit.wikimedia.org/r/949852 (https://phabricator.wikimedia.org/T344324) (owner: 10Effie Mouzeli) [10:08:02] (03Merged) 10jenkins-bot: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:08:27] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:09:11] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1107.eqiad.wmnet with OS bullseye [10:09:59] !log installing ghostscript security updates [10:10:01] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1106.eqiad.wmnet with OS bullseye [10:10:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1107.eqiad.wmnet with OS bullseye [10:12:24] (03PS1) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [10:13:07] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:13:38] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:13:43] (03PS1) 10Giuseppe Lavagetto: python3 compatibility [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949932 [10:13:45] (03PS1) 10Giuseppe Lavagetto: Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 [10:13:54] (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [10:14:08] (03PS2) 10Giuseppe Lavagetto: Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 [10:14:20] (03Abandoned) 10Giuseppe Lavagetto: python3 compatibility [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949932 (owner: 10Giuseppe Lavagetto) [10:15:32] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:16:05] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:16:55] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Fabfur) [10:17:23] (03PS1) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) [10:19:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 (owner: 10Giuseppe Lavagetto) [10:20:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Convert to python3, bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949933 (owner: 10Giuseppe Lavagetto) [10:21:41] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:22:20] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:22:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27 [10:23:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27 [10:23:11] !log depool kartotherian (maps) codfw - T344324 [10:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:14] T344324: Maps Unavailability (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [10:23:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:23:23] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:23:27] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1107.eqiad.wmnet with reason: host reimage [10:23:34] (03CR) 10Ayounsi: "Running PCC on all the fleet: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42915/" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [10:26:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1107.eqiad.wmnet with reason: host reimage [10:28:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:28:31] (03CR) 10Filippo Giunchedi: "Rules LGTM, I'm adding Keith since AFAICS the sli/slo rules live in modules/profile/files/thanos/recording_rules.yaml (i.e. they are globa" [puppet] - 10https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [10:28:39] (03PS1) 10Muehlenhoff: Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935 [10:29:10] <_joe_> uhm high errors on parsoid [10:30:49] so far I only see timeouts [10:31:08] <_joe_> yep [10:31:39] <_joe_> and OOMs [10:32:03] (03CR) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [10:32:35] (03PS2) 10Filippo Giunchedi: aux: add tlsHostnames for jaeger collector and query [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) [10:33:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:34:45] (03CR) 10Cathal Mooney: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [10:35:13] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [10:35:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935 (owner: 10Muehlenhoff) [10:36:05] (03CR) 10Ayounsi: [C: 03+2] Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [10:36:43] (03Merged) 10jenkins-bot: Homer: remove all mentions of old esams [homer/public] - 10https://gerrit.wikimedia.org/r/949853 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [10:37:37] (03CR) 10Muehlenhoff: [C: 03+2] Reinstate ganeti Netbox sync for esams01 [puppet] - 10https://gerrit.wikimedia.org/r/949935 (owner: 10Muehlenhoff) [10:44:28] (03CR) 10Hnowlan: [C: 03+2] deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [10:45:10] (03PS1) 10Ssingh: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) [10:46:05] (03CR) 10CI reject: [V: 04-1] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [10:47:45] (03PS2) 10Ssingh: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) [10:49:18] authdns-update is currently broken, fixing it ^ [10:49:25] just as an FYI [10:49:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1107.eqiad.wmnet with OS bullseye [10:50:01] (NodeTextfileStale) firing: (5) Stale textfile for maps2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:53:29] (03CR) 10Ayounsi: 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [10:53:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27 [10:53:42] (03CR) 10Ayounsi: [C: 03+1] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [10:53:51] (03CR) 10Ssingh: [C: 03+2] 10.in-addr.arpa: remove include for 0.20.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949938 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [10:54:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti3007.esams.wmnet to cluster esams01 and group BY27 [10:54:18] !log run authdns-update for CR 949938 [10:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:18] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing DNS names for ae1-103.cr2-esams and vrrp-gw-103 - sukhe@cumin2002" [10:56:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removing DNS names for ae1-103.cr2-esams and vrrp-gw-103 - sukhe@cumin2002" [10:56:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:02:33] (03PS1) 10Muehlenhoff: Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355) [11:03:09] (03CR) 10Ssingh: [C: 03+1] Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [11:04:41] (03PS1) 10Giuseppe Lavagetto: Adapt the control file to bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949942 [11:04:51] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:04:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Adapt the control file to bullseye [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/949942 (owner: 10Giuseppe Lavagetto) [11:04:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3003.esams.wmnet [11:05:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:05:11] (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) [11:05:13] (03PS3) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) [11:05:40] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [11:05:56] (03CR) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [11:06:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3003.esams.wmnet - jmm@cumin2002" [11:07:09] (03PS2) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [11:07:12] (03Merged) 10jenkins-bot: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:07:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3003.esams.wmnet - jmm@cumin2002" [11:07:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:45] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3003.esams.wmnet on all recursors [11:07:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3003.esams.wmnet on all recursors [11:07:51] (03PS4) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) [11:08:04] (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [11:08:10] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002" [11:08:53] (03CR) 10Giuseppe Lavagetto: httpd-fcgi: de-quote unicode characters in logs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [11:08:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002" [11:09:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: de-quote unicode characters in logs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/948139 (https://phabricator.wikimedia.org/T340935) (owner: 10Giuseppe Lavagetto) [11:09:26] (03PS3) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [11:10:24] (03CR) 10CI reject: [V: 04-1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [11:10:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye [11:10:59] (03PS4) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [11:11:04] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:11:32] <_joe_> jouncebot: nowandnext [11:11:32] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [11:11:32] In 0 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1200) [11:11:43] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:12:05] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:12:42] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:12:51] (03PS1) 10MVernon: hiera: add swift user search_update_pipeline [puppet] - 10https://gerrit.wikimedia.org/r/949943 (https://phabricator.wikimedia.org/T342620) [11:12:55] (03PS1) 10MVernon: hiera: add fake credential for swift user search_update_pipeline [labs/private] - 10https://gerrit.wikimedia.org/r/949944 (https://phabricator.wikimedia.org/T342620) [11:13:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:15:07] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:16:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:16:28] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:16:38] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:57] (03PS1) 10Hnowlan: aqs: enable geo_analytics user [puppet] - 10https://gerrit.wikimedia.org/r/949947 (https://phabricator.wikimedia.org/T336400) [11:17:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10SDeckelmann-WMF) I approve. [11:19:53] (03PS5) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [11:21:06] (03CR) 10Ssingh: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [11:21:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:22:09] (03CR) 10Btullis: "Removing my +1 because I've just thought of something that's going to make it fail." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [11:22:28] (03PS4) 10Clément Goubert: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:23:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10Clement_Goubert) [11:25:20] (03CR) 10Clément Goubert: [C: 03+2] Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:27:06] (03CR) 10Ssingh: Remove all mentions of old-esams, replace with new esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [11:27:12] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:14] (03CR) 10Jbond: "follow up comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [11:27:43] (03Merged) 10jenkins-bot: Remove limits in ResourceQuota and container limitanges for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/948128 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [11:29:12] (03CR) 10Ssingh: Use only active authdns hosts for DNS changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [11:29:37] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:29:44] (03PS2) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) [11:29:53] (03CR) 10Ayounsi: Remove all mentions of old-esams, replace with new esams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [11:31:28] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:31:39] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:32:18] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:32:30] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:33:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:34:11] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:35:00] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:36:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:10] (03CR) 10Jelto: "I'll postpone deployment of the config change to the next maintenance which requires restart/reboot." [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [11:39:25] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:53] (03PS1) 10Clément Goubert: mediawiki: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) [11:47:04] (03PS1) 10Jbond: check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) [11:49:15] (03CR) 10David Caro: [C: 03+1] "LGTM! thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond) [11:49:40] (03CR) 10Jbond: [C: 03+2] configmaster: pass puppet_ca_server via vhost settings [puppet] - 10https://gerrit.wikimedia.org/r/949855 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [11:55:20] (03CR) 10Filippo Giunchedi: [C: 03+1] check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond) [11:59:30] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/949957 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1200) [12:03:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir3003.esams.wmnet with OS bullseye [12:03:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir3003.esams.wmnet [12:04:39] !log restart jwt-authorizer service (docker-registry-ha-jwt.service) on registry nodes - T337474 [12:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:42] T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474 [12:07:14] (03PS1) 10Muehlenhoff: ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 [12:09:55] (03PS3) 10Acamicamacaraca: Enable VisualEditor in Draft and Project namespace on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) [12:14:47] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:16:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:18] (03PS4) 10Acamicamacaraca: Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) [12:20:17] (03CR) 10Ayounsi: [C: 03+1] knams migration: remove references to old esams (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [12:20:48] (03CR) 10JMeybohm: [C: 03+1] miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [12:21:12] (03CR) 10Jelto: [C: 03+2] miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [12:22:05] (03Merged) 10jenkins-bot: miscweb: add www.wikiworkshop.org to extraFQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/949845 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [12:22:59] PROBLEM - config-master.wikimedia.org requires authentication on config-master1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:24:27] RECOVERY - config-master.wikimedia.org requires authentication on config-master1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:25:10] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [12:25:28] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:26:05] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:26:21] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:26:29] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:57] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:27:07] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:27:46] (03PS1) 10Urbanecm: cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391) [12:28:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:32:23] (03CR) 10Ayounsi: "Full PCC output: https://puppet-compiler.wmflabs.org/output/949934/42915/" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [12:32:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:15] (03PS1) 10Jbond: config-master: enable ssl for proxies [puppet] - 10https://gerrit.wikimedia.org/r/949969 (https://phabricator.wikimedia.org/T341717) [12:34:18] (03CR) 10Jbond: [C: 03+2] check_puppetrun: update to use failed_resources Puppet::Transaction::Report [puppet] - 10https://gerrit.wikimedia.org/r/949959 (https://phabricator.wikimedia.org/T337951) (owner: 10Jbond) [12:36:46] (03CR) 10Jbond: [C: 03+2] config-master: enable ssl for proxies [puppet] - 10https://gerrit.wikimedia.org/r/949969 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [12:40:06] (03CR) 10EoghanGaffney: [C: 03+2] releases jenkins: allow Scap to disable services on secondary hosts [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447) (owner: 10Jaime Nuche) [12:41:06] jbond: Happy for me to merge your change? [12:45:28] (03CR) 10Ssingh: [C: 03+1] ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 (owner: 10Muehlenhoff) [12:46:36] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [12:49:58] (03CR) 10Stevemunene: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [12:55:57] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:58:03] (03CR) 10Herron: "Effie and I are planning to deploy this in codfw shortly and monitor tegola closely before making a go/no-go decision for eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [12:59:25] (03PS1) 10Ssingh: 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1300). nyaa~ [13:00:05] Urbanecm, aanzx, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] * Aca waves [13:00:16] i can deploy today [13:00:21] o/ [13:00:22] (03CR) 10CI reject: [V: 04-1] 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:00:32] (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [13:00:38] (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [13:00:42] (03CR) 10Urbanecm: [C: 03+2] cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391) (owner: 10Urbanecm) [13:01:00] (03CR) 10Muehlenhoff: [C: 03+2] ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 (owner: 10Muehlenhoff) [13:01:02] (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca) [13:01:10] (03PS2) 10Muehlenhoff: ncredir: Use globbing to select the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/949962 [13:01:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca) [13:01:50] (03CR) 10Muehlenhoff: [C: 03+2] Add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949941 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [13:02:03] (03Merged) 10jenkins-bot: Enable VisualEditor in Project and Draft namespaces on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca) [13:04:33] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]] [13:04:40] T344432: Enable VisualEditor in Project and Draft namespaces on shwiki - https://phabricator.wikimedia.org/T344432 [13:04:56] checking now [13:05:29] !log urbanecm@deploy1002 urbanecm and aleksandar: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:05:47] Aca: it's only available at debug servers now, but please go ahead :) [13:06:07] yeah, thats what I was referring to [13:08:18] k8s build phase of scap resulted in an error. it says non-k8s deployment will proceed, but any idea how to fix that? https://www.irccloud.com/pastebin/cGx7BdzE/ [13:08:18] After refreshing, VisualEditor tab is now shown in the toolbar. [13:08:26] lgtm [13:08:35] Aca: great. waiting now on the k8s failure error. [13:09:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3003.esams.wmnet [13:09:10] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:10:07] seems it happened while building `webserver-image-base` (while running `docker build --pull --build-arg "http_proxy=http://webproxy.eqiad.wmnet:8080" --build-arg "https_proxy=http://webproxy.eqiad.wmnet:8080" -f Dockerfile.webserver-base-image -t webserver-image-base . `) [13:10:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:27] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3003.esams.wmnet on all recursors [13:10:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3003.esams.wmnet on all recursors [13:10:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002" [13:11:13] (03CR) 10Gehel: "I spot checked the changes between the new systemd unit and the processes running on wdqs1003 (both main and categories) and on wcqs1001. " [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [13:11:40] (03PS10) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) [13:12:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3003.esams.wmnet - jmm@cumin2002" [13:12:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye [13:15:32] !log urbanecm@deploy1002 urbanecm and aleksandar: Continuing with sync [13:15:32] (03PS1) 10Ssingh: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) [13:15:50] proceeding, the k8s bits seems to be auto-disabled by scap. filling task. [13:16:24] (03CR) 10Filippo Giunchedi: "Removing myself as this is on hold for now" [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [13:16:26] (03CR) 10CI reject: [V: 04-1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:17:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:19:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:19:50] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949581|Enable VisualEditor in Project and Draft namespaces on shwiki (T344432)]] (duration: 15m 16s) [13:19:54] T344432: Enable VisualEditor in Project and Draft namespaces on shwiki - https://phabricator.wikimedia.org/T344432 [13:20:39] filled the k8s bug as T344438. [13:20:40] T344438: scap backport fails to build a image for k8s deployment - https://phabricator.wikimedia.org/T344438 [13:20:40] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/949577 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [13:21:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:21:10] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Make it possible to revalidate based on score [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949576 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [13:21:13] (03Merged) 10jenkins-bot: cross-wiki userrights: Add SpecialUserRights::getDisplayUsername [core] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/949582 (https://phabricator.wikimedia.org/T344391) (owner: 10Urbanecm) [13:22:01] (03CR) 10Effie Mouzeli: [C: 03+1] "Good to go for testing on codfw" [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [13:22:21] continuing with backports, as the core one fixes an UBN, which is better to have fixed at least for the non-k8s world. [13:23:02] (03CR) 10Herron: [C: 03+2] thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/948125 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [13:23:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]] [13:23:11] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [13:23:12] T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309 [13:23:12] T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391 [13:23:47] (03PS2) 10Ayounsi: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:23:55] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.c [13:23:55] odfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:24:31] !log urbanecm@deploy1002 urbanecm: Continuing with sync [13:24:42] (03CR) 10CI reject: [V: 04-1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:26:16] moritzm: Did anything change regarding gpg signing on the apt repo ? [13:26:27] (03PS3) 10Ayounsi: Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:26:36] moritzm: context https://phabricator.wikimedia.org/T344438 [13:27:24] claime: fwiw, a very similar error happened as T338952 recently. dunno how much related those two actually are, but just in case... :) [13:27:25] T338952: mwaddlink fails to build because of a missing public key - https://phabricator.wikimedia.org/T338952 [13:27:27] ah, just saw your comment urbanecm, I'll check something [13:27:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [13:27:39] sounds good :) [13:28:19] (03CR) 10Ssingh: [C: 03+1] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:28:52] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949582|cross-wiki userrights: Add SpecialUserRights::getDisplayUsername (T344391 T255309)]], [[gerrit:949577|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]], [[gerrit:949576|revalidateLinkRecommendations: Make it possible to revalidate based on score (T316079)]] (duration: 05m 46s) [13:28:58] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [13:28:58] T255309: Remove UserRightsProxy and replace its usages with UserGroupManager - https://phabricator.wikimedia.org/T255309 [13:28:58] T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391 [13:29:05] * urbanecm finished in-progress deployments now. [13:29:23] waiting for now, as i don't want to divert k8s and non-k8s worlds even more. [13:29:31] (03CR) 10Ayounsi: [C: 03+2] Remove PTRs for 91.198.174.0/24 and 2620:0:862::/48 [dns] - 10https://gerrit.wikimedia.org/r/949975 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:30:21] <_joe_> urbanecm: can you wait a sec please? [13:30:41] _joe_: yeah, i'm waiting. [13:31:10] <_joe_> sigh another incident on esams [13:31:19] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:31:22] <_joe_> XioNoX / topranks cr1-esams paging agoain [13:31:55] !incidents [13:31:55] 3951 (RESOLVED) [12x] ProbeDown sre (probes/service esams) [13:32:02] _joe_: hmm that’s a new box, maybe added to monitor off prematurely [13:32:10] they're both downtimed in icinga [13:32:15] <_joe_> says "snooze expired" [13:32:21] <_joe_> in victorops [13:32:23] _joe_: what's the page I don't see it here [13:32:24] ahhh [13:32:25] ok [13:32:25] <_joe_> and now resolved [13:32:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [13:32:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:33:01] _joe_: Resolved it manually. [13:34:47] I assumed it would go away after it was down-timed but I guess not. [13:35:55] (03CR) 10Jbond: [C: 03+2] puppetserver: Add support for defining additional mount points [puppet] - 10https://gerrit.wikimedia.org/r/948567 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:35:59] (03CR) 10Jbond: [C: 03+2] P:puppetserver: add support for extra_mounts [puppet] - 10https://gerrit.wikimedia.org/r/948607 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:36:01] !log pooling kartotherian (maps) on codfw - T344324 [13:36:03] (03CR) 10Jbond: [C: 03+2] puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] T344324: Maps Unavailability (14 Aug 2023) - https://phabricator.wikimedia.org/T344324 [13:36:22] (03CR) 10Esanders: "There is an ongoing discussion on the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949581 (https://phabricator.wikimedia.org/T344432) (owner: 10Acamicamacaraca) [13:36:37] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [13:37:14] (03PS5) 10Jbond: puppetserver: add volatile file mount [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) [13:38:28] (03CR) 10Jbond: [C: 03+2] puppetserver: add volatile file mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948608 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:40:26] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:40:34] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [13:40:42] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:41:02] (03PS6) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [13:42:09] (03CR) 10Ssingh: "rebased to exclude changes already in master, purging of /24 and /48 PTRs" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:42:42] (03CR) 10Ayounsi: [C: 03+1] knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [13:42:50] the maps error is being handled [13:44:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:36] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:44:46] (03PS2) 10Andrew Bogott: wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 [13:44:46] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [13:45:08] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:45:17] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [13:45:38] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:46:02] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:46:12] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:46:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bullseye [13:46:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3003.esams.wmnet [13:46:18] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:47:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir3004.esams.wmnet [13:47:04] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:47:25] (03CR) 10Ssingh: [C: 03+1] trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:47:42] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:43] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [13:48:58] (03CR) 10JMeybohm: aux: add tlsHostnames for jaeger collector and query (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949504 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [13:49:10] (03CR) 10Bking: [C: 03+2] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/949503 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [13:50:37] (03CR) 10Andrew Bogott: [C: 03+2] wmsc-backup: correct ids passed for differential image backup [puppet] - 10https://gerrit.wikimedia.org/r/949610 (owner: 10Andrew Bogott) [13:50:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3004.esams.wmnet - jmm@cumin2002" [13:53:18] (03PS1) 10Muehlenhoff: Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) [13:54:17] (03CR) 10JMeybohm: "I did implement what I proposed in https://phabricator.wikimedia.org/T277876#9095795 - we can adapt the calculations if we feel reservatio" [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [13:54:49] (03PS3) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [13:54:51] (03PS1) 10Jbond: puppetserver: switch to useing ca_server instead of enable_ca [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) [13:55:15] (03CR) 10Vgutierrez: [C: 03+1] "TLS material looking good on eqiad & codfw deployments:" [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:55:32] (03CR) 10Ssingh: Add durum300[34] to site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [13:55:36] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:55:44] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:55:49] (03CR) 10Bking: [C: 03+1] hiera: add swift user search_update_pipeline [puppet] - 10https://gerrit.wikimedia.org/r/949943 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon) [13:55:50] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:56:22] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:56:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42919/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:56:52] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:56:56] (03CR) 10Vgutierrez: [C: 03+2] tests: fix CertificateState tests on python 3.10+ [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949505 (https://phabricator.wikimedia.org/T344330) (owner: 10Vgutierrez) [13:57:38] (03PS1) 10Giuseppe Lavagetto: Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) [13:57:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3003.esams.wmnet [13:57:44] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:57:51] (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:57:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42920/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:58:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir3004.esams.wmnet - jmm@cumin2002" [13:58:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:58:02] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir3004.esams.wmnet on all recursors [13:58:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir3004.esams.wmnet on all recursors [13:58:15] (03CR) 10Clément Goubert: [C: 03+1] Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) (owner: 10Giuseppe Lavagetto) [13:58:36] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3004.esams.wmnet - jmm@cumin2002" [13:59:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42921/console" [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [13:59:13] (03PS2) 10Giuseppe Lavagetto: Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) [13:59:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir3004.esams.wmnet - jmm@cumin2002" [13:59:25] !log bking@cumin1001 'disabling puppet on wcqs/wdqs to test 949503' [13:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:28] (03PS2) 10Muehlenhoff: Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) [13:59:43] (03CR) 10Muehlenhoff: Add durum300[34] to site.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [13:59:53] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Re-update artificially images to overcome a docker-pkg bug [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/949979 (https://phabricator.wikimedia.org/T344438) (owner: 10Giuseppe Lavagetto) [14:00:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: switch to useing ca_server instead of enable_ca [puppet] - 10https://gerrit.wikimedia.org/r/949978 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:00:30] any maps alerts will clear soon, sorry for the noise [14:00:55] (03CR) 10Ssingh: [C: 03+1] Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [14:01:20] (03CR) 10Muehlenhoff: [C: 03+2] Add durum300[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949977 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [14:01:34] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:01:49] (03PS2) 10Vgutierrez: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [14:02:09] (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [14:03:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bullseye [14:03:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3003.esams.wmnet - jmm@cumin2002" [14:04:08] <_joe_> urbanecm: I'll re-deploy to k8s [14:04:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3003.esams.wmnet - jmm@cumin2002" [14:04:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:28] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3003.esams.wmnet on all recursors [14:04:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3003.esams.wmnet on all recursors [14:04:39] ack, ty. i have some patches to finish, too. [14:04:43] ping me once ready for me :) [14:04:54] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3003.esams.wmnet - jmm@cumin2002" [14:05:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3003.esams.wmnet - jmm@cumin2002" [14:05:53] !log oblivian@deploy1002 Started scap: (no justification provided) [14:06:13] <_joe_> urbanecm: it will take some time, I'm rebuilding the images from scratch [14:06:24] ok, noted. [14:06:34] * urbanecm is having fun with our CI in the meantime. [14:06:40] !log oblivian@deploy1002 sync-world aborted: (no justification provided) (duration: 00m 46s) [14:08:28] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.744 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:09:06] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 9.879 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:09:08] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:09:24] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.664 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:10:40] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:10:50] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:11:40] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:45] !log oblivian@deploy1002 Started scap: (no justification provided) [14:12:12] (03PS1) 10Jbond: puppetserver: dont auto restart puppet server [puppet] - 10https://gerrit.wikimedia.org/r/949980 (https://phabricator.wikimedia.org/T330490) [14:12:49] (03CR) 10Jbond: [C: 03+2] puppetserver: dont auto restart puppet server [puppet] - 10https://gerrit.wikimedia.org/r/949980 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:14:29] 10SRE, 10Acme-chief, 10Traffic: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10Vgutierrez) [14:16:30] 10SRE, 10SRE-swift-storage, 10Traffic, 10MediaWiki-Platform-Team (Radar), and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Krinkle) [14:16:39] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [14:17:57] (03PS1) 10Ayounsi: cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981 [14:18:14] (03CR) 10Vgutierrez: "looking good, please see inline comments." [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [14:19:16] (03PS3) 10Ayounsi: Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) [14:19:47] (03CR) 10Ayounsi: [C: 03+2] cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981 (owner: 10Ayounsi) [14:19:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bullseye [14:20:18] (03Merged) 10jenkins-bot: cr1-esams: add transit an LACP min links [homer/public] - 10https://gerrit.wikimedia.org/r/949981 (owner: 10Ayounsi) [14:20:42] (SystemdUnitFailed) firing: (2) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [14:21:05] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:21:25] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:21:43] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361 [14:21:43] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service,wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:46] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [14:21:56] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361 [14:24:30] (03PS4) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [14:27:00] (03CR) 10CI reject: [V: 04-1] puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [14:29:20] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:29:22] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:29:56] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:33:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bullseye [14:34:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir3004.esams.wmnet [14:34:28] (03PS1) 10Clément Goubert: Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583 [14:36:39] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:36:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:37:48] !log oblivian@deploy1002 Finished scap: (no justification provided) (duration: 26m 03s) [14:39:03] (03CR) 10Ssingh: [C: 03+1] "Looks good! Thanks. I think whatever PCC failures we have are unrelated but definitely could use another pair of eyes on this." [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [14:41:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [14:41:09] (03CR) 10Clément Goubert: [C: 03+2] Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583 (owner: 10Clément Goubert) [14:43:31] (03Merged) 10jenkins-bot: Revert "Remove limits in ResourceQuota and container limitanges for mediawiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/949583 (owner: 10Clément Goubert) [14:44:08] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:44:36] !log Rolling back 949583 for T344438 [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:40] T344438: scap backport fails to build a image for k8s deployment - https://phabricator.wikimedia.org/T344438 [14:45:47] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:45:57] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:46:22] 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10Yug) Given the bug is persisting and preventing loggin, we may want to use the sitenotice to gently announce a pause in contributions / log in. [14:46:27] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:46:47] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:47:23] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2002.codfw.wmnet with OS bookworm [14:47:35] (03PS1) 10Btullis: Failover hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/949984 (https://phabricator.wikimedia.org/T303168) [14:47:37] (03PS1) 10Btullis: Fail back hive to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/949985 (https://phabricator.wikimedia.org/T303168) [14:47:43] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [14:48:01] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:48:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:48:49] (03CR) 10Btullis: [C: 03+2] Failover hive to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/949984 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [14:48:50] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:49:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host durum3004.esams.wmnet [14:49:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:49:09] !log Re-deploying mw-on-k8s T344438 [14:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:49:24] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:49:25] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:49:27] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:49:28] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:49:40] 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10Joe) It would be useful if you could follow https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue to give us a bit more details to go by. [14:50:56] (03PS1) 10Muehlenhoff: Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355) [14:51:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:51:21] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:51:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3004.esams.wmnet - jmm@cumin2002" [14:51:34] 10SRE, 10Lingua Libre, 10Traffic: Network issue between LinguaLibre and Wikimedia Commons - https://phabricator.wikimedia.org/T344421 (10ayounsi) Hi, see https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue for more troubleshooting commands, but to start with could you provide the output of: `... [14:51:44] (03CR) 10Ssingh: [C: 03+1] Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [14:52:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:52:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:54:00] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:54:01] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3003.esams.wmnet with OS bullseye [14:54:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3003.esams.wmnet [14:55:21] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:55:22] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:56:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:56:51] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:57:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:57:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:57:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:57:30] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:57:32] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:58:01] 10SRE, 10MediaWiki-Core-Revision-backend, 10Performance-Team (Radar): Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle) [14:58:07] urbanecm: ok, fixed and redeployed [14:58:17] claime: thanks for the fix! [14:58:17] bare metal and k8s should be sync'd now [14:58:26] urbanecm: mostly j.oe tbh [14:58:32] thanks to both :) [14:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum3004.esams.wmnet - jmm@cumin2002" [14:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:28] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache durum3004.esams.wmnet on all recursors [14:59:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum3004.esams.wmnet on all recursors [14:59:41] jouncebot: nowandnext [14:59:41] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [14:59:41] In 1 hour(s) and 0 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1600) [14:59:48] aanzx: are you still here for your patch? [14:59:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3004.esams.wmnet - jmm@cumin2002" [15:00:44] urbanecm: yes [15:00:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum3004.esams.wmnet - jmm@cumin2002" [15:01:04] (03PS1) 10Urbanecm: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) [15:01:10] okay, let's go ahead. [15:01:14] (03PS3) 10Urbanecm: suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [15:01:16] (03CR) 10Muehlenhoff: [C: 03+2] Add doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949987 (https://phabricator.wikimedia.org/T344355) (owner: 10Muehlenhoff) [15:01:18] (03CR) 10Urbanecm: [C: 03+2] suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [15:01:28] (03PS2) 10Urbanecm: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) [15:01:31] (03CR) 10Urbanecm: [C: 03+2] Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [15:02:01] (03Merged) 10jenkins-bot: suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949568 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [15:02:15] (03Merged) 10jenkins-bot: Growth: Temporarily disable link-recommendation FE on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949988 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [15:02:47] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]] [15:02:52] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [15:02:52] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [15:03:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh3003.wikimedia.org [15:03:09] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:03:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host durum3004.esams.wmnet with OS bullseye [15:04:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:40] !log urbanecm@deploy1002 urbanecm and anzx: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (ac [15:04:40] cessible via k8s-experimental XWD option) [15:04:48] aanzx: please test. [15:05:01] urbanecm:ok [15:05:41] (03CR) 10Bking: [C: 03+1] hiera: add fake credential for swift user search_update_pipeline [labs/private] - 10https://gerrit.wikimedia.org/r/949944 (https://phabricator.wikimedia.org/T342620) (owner: 10MVernon) [15:07:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - jmm@cumin2002" [15:07:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh3003.wikimedia.org - jmm@cumin2002" [15:07:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:48] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh3003.wikimedia.org on all recursors [15:07:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh3003.wikimedia.org on all recursors [15:08:19] 10SRE-swift-storage, 10observability, 10EngProd-Virtual-Hackathon: Add FileBackend statsd metrics and a dashboard - https://phabricator.wikimedia.org/T217754 (10Krinkle) [15:08:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - jmm@cumin2002" [15:08:31] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:57] aanzx: how is it looking please? [15:09:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh3003.wikimedia.org - jmm@cumin2002" [15:09:32] urbanecm: looks good [15:09:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh3003.wikimedia.org with OS bullseye [15:10:01] ack, syncing. [15:10:03] !log urbanecm@deploy1002 urbanecm and anzx: Continuing with sync [15:11:13] urbanecm: i don't know why this patch is giving CI error https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/949570 can you take a look [15:11:52] (03CR) 10Btullis: [C: 03+2] Fail back hive to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/949985 (https://phabricator.wikimedia.org/T303168) (owner: 10Btullis) [15:11:56] aanzx: i rebased that patch, let's see if it happens again. [15:13:08] Ok [15:16:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [15:17:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949568|suwikisource remove NamespaceAliases and ExtraNamespaces for Page and Index namespace (T344314)]], [[gerrit:949988|Growth: Temporarily disable link-recommendation FE on arwiki (T316079)]] (duration: 14m 56s) [15:17:48] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [15:17:48] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [15:17:51] aanzx: should be live. [15:18:04] urbanecm: thanks [15:18:08] np [15:20:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3004.esams.wmnet with reason: host reimage [15:21:19] urbanecm: are you finished with the backports? [15:21:33] jnuche: for now, yes :). thanks [15:21:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:21] urbanecm: thx, I'm going to update scap [15:22:56] !log jnuche@deploy1002 Installing scap version "4.58.0" for 597 hosts [15:24:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) [15:25:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) Revalidation by manager waived since it's a ctr to req conversion. SSH key double-checked on authenticated out of band. Patch i... [15:25:06] (03CR) 10Ssingh: [C: 03+1] admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405) (owner: 10Clément Goubert) [15:26:01] !log jnuche@deploy1002 Installing scap version "4.58.0" for 596 hosts [15:26:11] (03CR) 10Clément Goubert: [C: 03+2] admin: New ssh key for ppenloglou [puppet] - 10https://gerrit.wikimedia.org/r/949839 (https://phabricator.wikimedia.org/T344405) (owner: 10Clément Goubert) [15:26:28] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) [15:26:57] !log jnuche@deploy1002 Installation of scap version "4.58.0" completed for 596 hosts [15:27:26] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/949570 after rebase it worked , thanks [15:27:40] no worries. [15:28:10] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert Patch merged, access should be updated after half an hour once puppet has run... [15:29:13] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) @BTullis: once Omari is added to the analytics private data users group do you prefer he ping you here or on {T328457}... [15:31:15] urbanecm: is T344446 related to T344391 or are those separate bugs? [15:31:16] T344446: Notification received from metawiki instead of target site when group modified on metawiki - https://phabricator.wikimedia.org/T344446 [15:31:16] T344391: Interwiki user rights changes not being logged correctly - https://phabricator.wikimedia.org/T344391 [15:31:56] taavi: my assumption would be that they're one and the same bug, but let me try how notifications work now. [15:32:37] taavi: nope, it's a separate bug. [15:32:59] * urbanecm is noting that on task. [15:34:25] * urbanecm wishes he has a magic wand to command T342763 completed [15:34:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3004.esams.wmnet with OS bullseye [15:34:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum3004.esams.wmnet [15:34:41] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@2d3a0b7] (releasing): (no justification provided) [15:35:24] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@2d3a0b7] (releasing): (no justification provided) (duration: 00m 43s) [15:35:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:01] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [15:37:38] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (10MoritzMuehlenhoff) [15:38:17] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) [15:41:23] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:42:22] (03PS1) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation FE on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949585 (https://phabricator.wikimedia.org/T316079) [15:42:38] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10Clement_Goubert) a:03OSefu-WMF @mpopov I'll be handling the access request as well as kerberos principal, but @BTullis will... [15:46:23] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:47:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:49:32] 10SRE-swift-storage, 10Data-Persistence, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10Gehel) [15:50:26] !log sukhe@alert1001:~$ sudo systemctl reload icinga.service [15:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:46] (03PS3) 10BCornwall: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) [15:51:51] (03CR) 10BCornwall: Update dependencies to match Bookworm versions (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:52:05] (03CR) 10CI reject: [V: 04-1] Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [15:52:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:52:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [15:54:35] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:55:18] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) Welcome @OSefu-WMF ! I'd be very grateful if you could do me a small favour please. >>! In T344257#9099732, @Clemen... [15:55:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:57:10] (03PS1) 10Ssingh: site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993 [15:58:52] (03PS2) 10Ssingh: site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993 [15:59:44] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10mpopov) @Clement_Goubert: Thank you! For my own future reference and @OSefu-WMF's clarification – do you mean adding the publi... [16:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:14] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host doh3003.wikimedia.org with OS bullseye [16:00:14] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host doh3003.wikimedia.org [16:01:49] (03PS1) 10Muehlenhoff: Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994 [16:02:14] (03CR) 10Ssingh: [C: 03+1] Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994 (owner: 10Muehlenhoff) [16:02:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/949993 (owner: 10Ssingh) [16:02:30] moritzm: merge any and I will abandon the other :) [16:02:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2002.codfw.wmnet with OS bookworm [16:02:39] (03Abandoned) 10Muehlenhoff: Fix doh300[34] entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/949994 (owner: 10Muehlenhoff) [16:02:54] sukhe: go ahead with a merge, I just abandoned mine :-) [16:03:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [16:03:32] (03CR) 10Ssingh: [C: 03+2] site.pp: use correct hostname for doh300[34] [puppet] - 10https://gerrit.wikimedia.org/r/949993 (owner: 10Ssingh) [16:09:55] (03PS1) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007 [16:11:23] !log merging Puppet change 949934 - Remove all mentions of old-esams, replace with new esams [16:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:26] (03CR) 10Ayounsi: [C: 03+2] Remove all mentions of old-esams, replace with new esams [puppet] - 10https://gerrit.wikimedia.org/r/949934 (https://phabricator.wikimedia.org/T329219) (owner: 10Ayounsi) [16:12:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:11] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 461, down: 18, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:13:49] (03PS2) 10Herron: Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007 (https://phabricator.wikimedia.org/T343987) [16:14:14] !log force agent run on A:lvs and A:esams [16:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:48] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:14:52] (03CR) 10Herron: [C: 03+2] Revert "thanos-fe: switch to cfssl" [puppet] - 10https://gerrit.wikimedia.org/r/950007 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron) [16:16:09] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:09] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:13] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:18] ulsfo? [16:18:09] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) [16:18:50] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10colewhite) Confirmed Omari's ssh key via Slack DM. `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIP3p3IF96m0/MLPgxWxgEbo6QyGZEMc8fj6bn3... [16:19:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:19:39] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMDE and LDAP/NDA for mareikeheuer - https://phabricator.wikimedia.org/T344341 (10KFrancis) The NDA has been signed. Please proceed with next steps. Thank you! [16:20:15] sukhe: 198.35.26.7 64605 9 7 0 16 2:56 Establ [16:20:21] so bgp bounced 3min ago [16:20:30] dns4003 [16:20:33] ok [16:21:19] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:21:28] looking [16:21:42] Error: 'asw2-esams' is not a valid parent for host 'ganeti3001' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17459)! [16:21:45] Error: 'asw2-esams' is not a valid parent for host 'ganeti3002' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17476)! [16:21:48] Error: 'asw2-esams' is not a valid parent for host 'ganeti3003' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17493)! [16:22:07] er, ganeti3001/2/3 are not decom yet? [16:22:40] yep [16:22:41] oh, that's what you were chatting about before with moritzm ? [16:22:41] decommed [16:22:45] no, that was doh [16:23:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:23:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:23:29] probably a Puppet run race condition then, let's check [16:24:19] https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/87b4e07ef5d462d28975f160ac7f1fcaeb48c9d5 [16:24:23] removals here [16:24:23] PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:13] sukhe: is there a way to expore the puppet ressources to know if there is a "stuck" ganeti one? [16:26:17] PROBLEM - Check systemd state on kubernetes2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:53] XioNoX: checking [16:27:00] I think the ferm failures above are also somewhat related [16:27:04] as in from this change [16:27:25] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:27:36] sukhe: I was looking at ferm, it says "Aug 17 16:19:55 ml-serve2005 ferm[1340493]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?" [16:28:27] XioNoX: fixed the mlserve one [16:28:30] just doing a "ml-serve2005:~$ sudo service ferm start" seems to have solved it [16:28:33] eh [16:28:33] yeah [16:28:37] sukhe: what did you do? [16:28:50] ran agent again, we have been seeing some puppet race conditions with ferm in other hosts [16:28:56] I see [16:29:03] RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:08] XioNoX: host_name ganeti3001 [16:29:08] hostgroups ganeti_esams,asw2-esams [16:29:14] just not seeing where it is coming from though [16:29:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 6.730 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:57] sukhe: fixed ferm on kubernetes2007 [16:30:03] thanks [16:30:46] jbond, you're still around? [16:30:55] RECOVERY - Check systemd state on kubernetes2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:47] address 10.20.0.31 [16:34:00] didn't we purge 10.20.0.0/23 from everywhere? [16:34:07] so this is related [16:36:05] er, I did a icinga restart instead of reload [16:36:10] I hope I didn't break it [16:36:34] there might be more alerts [16:36:44] from external monitoring, but otherwise should be fine [16:37:14] (03CR) 10Btullis: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [16:37:24] sukhe: I'll remoe them manually from the config and see if puppet tries to add them back [16:37:35] I did the same yesterday for something else [16:38:09] yeah that's one option though my concern is that if it's there and that might be a symptom of some other place the older config settings is left [16:38:12] but yeah go for it [16:38:48] sukhe: it's up now [16:38:51] running puppet [16:38:55] cool :) [16:39:17] do it on alert2001 as well then just to be sure [16:40:11] waiting for puppet to finish [16:40:41] sukhe: wow [16:40:49] (03CR) 10Btullis: datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [16:40:50] it added so many checks [16:40:51] * sukhe holds breath [16:41:06] alert1001? [16:41:54] er, it added back what you changed basically [16:42:02] so it removed a bunch of ganeti3001/2/3 checks [16:42:08] like service checks [16:42:27] yeah that's quite a lot of them [16:42:44] + host_name ganeti3002 [16:42:53] and it re-added them? [16:42:57] yep [16:43:00] same error is back [16:43:08] Error: 'asw2-esams' is not a valid parent for host 'ganeti3003' (file '/etc/icinga/objects/puppet_hosts.cfg', line 17493)! [16:43:13] let's look again [16:43:36] so we can re-add a dummy asw2-esams [16:43:42] as workaround [16:44:16] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:44:18] by dummy you mean just in Puppet but not in netbox? [16:44:25] sukhe: yeah [16:44:29] cwhite, anyone from o11y can help us out with icinga? [16:44:46] * cwhite looking [16:45:04] cwhite: tl;dr; ganeti3001/2/3 are decom, but puppet still adds them to /etc/icinga/objects/puppet_hosts.cfg [16:45:28] and it did remove them previously but not from this file I guess [16:45:31] which breaks icinga as the switch (parent) as they depend on is gone [16:46:07] ok [16:46:24] cwhite: https://puppetboard.wikimedia.org/report/alert1001.wikimedia.org/87b4e07ef5d462d28975f160ac7f1fcaeb48c9d5 [16:46:27] they were removed here [16:46:36] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:51] the parent, yeah, but dunno for the ganeti hosts [16:47:33] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2002.codfw.wmnet [16:47:48] file { '/etc/icinga/objects/puppet_hosts.cfg': [16:47:49] content => generate('/usr/local/bin/naggen2', '--type', 'hosts'), [16:47:52] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:08] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts flink-zk2002.codfw.wmnet [16:48:21] maybe running naggen2 manually? just speculating [16:48:35] sukhe: the puppet certs for these hosts are still alive [16:48:44] oh [16:48:48] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:51] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk2003.codfw.wmnet [16:48:52] !log bking@cumin1001 START - Cookbook sre.dns.netbox [16:49:34] https://phabricator.wikimedia.org/T344363#9098423 [16:49:37] probably need to cert destroy these hosts. that should purge their entries from puppet [16:49:43] Host steps raised exception: Cumin execution failed (exit_code=2) [16:49:48] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:00] seems like the cookbook failed, but not sure why [16:50:19] ok, now I see it https://puppetboard.wikimedia.org/node/ganeti3001.esams.wmnet [16:50:37] sukhe: I think the hosts were already unracked before decom [16:50:46] cwhite: thanks! [16:51:52] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001" [16:52:12] we can do sudo puppet node deactivate ganeti3001.esams.wmnet [16:52:23] cwhite: thanks indeed for the pointer! [16:52:32] sukhe: about to do `puppet node clean ganeti3001.esams.wmnet` [16:52:33] that should remove them from puppetdb [16:52:37] dunno the different with deactivate [16:52:48] yeah clean and then deactivatbe basically [16:52:53] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk2003.codfw.wmnet - bking@cumin1001" [16:52:53] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:53] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk2003.codfw.wmnet on all recursors [16:52:57] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk2003.codfw.wmnet on all recursors [16:53:01] worth a shot, should I try it? [16:53:15] I'm on it [16:53:18] ok thanks [16:53:21] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001" [16:53:28] https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB suggests clean and deactivate [16:53:32] in that order apparently [16:53:52] https://www.irccloud.com/pastebin/uDZu1Exm/ [16:54:00] ok [16:54:02] trying a run now [16:54:07] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk2003.codfw.wmnet - bking@cumin1001" [16:54:07] before you do the others [16:54:13] should be a safe operation but yeah good to verify [16:54:17] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2003.codfw.wmnet with OS bookworm [16:54:22] sukhe: too late :) [16:54:24] ha [16:54:28] I did it on all of them [16:54:30] all good :) [16:54:33] they're gone host anyway [16:54:36] yeah [16:54:44] XioNoX: let's move to -sre perhaps I guess [16:54:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:05] (03Abandoned) 10Ssingh: 10.in-addr.arpa: remove include for netbox/0.21.10.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/949972 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [16:59:25] (03CR) 10Ssingh: "Do not merge before Monday Aug 21" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [17:00:06] bd808: gettimeofday() says it's time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1700) [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1700) [17:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:11] I have updates to deploy for shellbox-syntaxhighlight today. Because the shellbox services share a helm chart and some code I will be redeploying all 5 shellbox* services. [17:03:22] (03CR) 10BryanDavis: [C: 03+2] shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460) (owner: 10BryanDavis) [17:04:12] (03Merged) 10jenkins-bot: shellbox: Bump to 2023-08-15-040901 [deployment-charts] - 10https://gerrit.wikimedia.org/r/949548 (https://phabricator.wikimedia.org/T335460) (owner: 10BryanDavis) [17:04:21] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:08:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:03] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [17:08:58] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [17:09:05] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [17:09:30] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [17:09:35] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 461, down: 18, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:36] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [17:10:09] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [17:10:16] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:10:45] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:10:52] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [17:11:56] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [17:11:59] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:20] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:15:25] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:15:32] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:16:22] (03PS5) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [17:16:52] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:16:59] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:16:59] (03PS6) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [17:17:40] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:17:47] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:35] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:18:41] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:19:41] (03CR) 10Eevans: [C: 03+1] deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400) (owner: 10Hnowlan) [17:19:45] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:20:25] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [17:21:27] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [17:21:33] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:22:16] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:22:22] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:23:11] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:23:17] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:23:55] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:24:02] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [17:24:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:00] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:32:29] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:15] (03PS1) 10Zabe: admin: New SSH key for zabe [puppet] - 10https://gerrit.wikimedia.org/r/949999 [17:37:54] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS bookworm [17:39:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm [17:39:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS bookworm [17:39:51] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm [17:41:13] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:30] ^ expected [17:41:51] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:41:53] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:42:07] ^ durum hosts that brett is reimaging, so all good [17:42:13] * brett waves [17:43:09] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-puppet-ca-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:01] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:09] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [17:44:29] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm [17:45:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4002.ulsfo.wmnet with OS bookworm [17:45:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS bookworm [17:45:30] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum5002.eqsin.wmnet with OS bookworm [17:45:39] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [17:45:40] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [17:45:59] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:17] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:19] This carnage is me mercilessly killing durum hosts [17:46:31] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:39] (JobUnavailable) firing: Reduced availability for job bird in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [17:47:53] Sorry for the spam. But think of all the wormies that will be crawling around when this is finished [17:49:28] (03PS1) 10Ssingh: site: reimage ncredir300[34] to proper role [puppet] - 10https://gerrit.wikimedia.org/r/950000 (https://phabricator.wikimedia.org/T344355) [17:49:31] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:37] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:49:43] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:49:47] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:57] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:50:02] ^Ignore! [17:50:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [17:51:39] (JobUnavailable) resolved: Reduced availability for job bird in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:52:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [17:55:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [17:56:16] (03CR) 10Ssingh: [C: 03+2] site: reimage ncredir300[34] to proper role [puppet] - 10https://gerrit.wikimedia.org/r/950000 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [17:56:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [17:57:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [17:57:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3003.esams.wmnet with OS bullseye [18:00:05] brennen and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T1800). [18:00:11] o/ [18:00:36] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724) [18:00:38] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [18:01:29] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950002 (https://phabricator.wikimedia.org/T343724) (owner: 10TrainBranchBot) [18:02:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [18:04:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [18:07:23] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [18:07:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:55] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [18:07:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [18:08:59] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.22 refs T343724 [18:09:02] T343724: 1.41.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T343724 [18:09:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:10:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [18:12:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [18:12:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk2003.codfw.wmnet with OS bookworm [18:12:33] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk2003.codfw.wmnet [18:12:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [18:13:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bookworm [18:14:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:15:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [18:15:48] * Krinkle debugging on mw1439 (jobrunner) [18:16:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:06] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [18:17:00] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:17:42] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:17:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3003.esams.wmnet with reason: host reimage [18:18:32] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm [18:22:16] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:18] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:50] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:32] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS bookworm [18:26:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS bookworm [18:28:22] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [18:29:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [18:30:01] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [18:33:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [18:33:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:33:37] eh? [18:33:42] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:35:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5002.eqsin.wmnet with reason: host reimage [18:35:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm [18:37:33] (03PS1) 10Ssingh: hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005 [18:39:32] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:52] (03CR) 10BCornwall: [C: 03+1] hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005 (owner: 10Ssingh) [18:41:00] (03CR) 10Ssingh: [C: 03+2] hiera: update acme-chief authorized_hosts for ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950005 (owner: 10Ssingh) [18:41:26] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:41:42] !log force agent run on A:acmechief for CR 950005 [18:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:43:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [18:43:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:47:24] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:47:33] ^ resolving [18:47:46] (03CR) 10Eevans: [C: 03+2] restbase: set legacy ssl port & optional encryption to false [puppet] - 10https://gerrit.wikimedia.org/r/949587 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:49:18] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:49:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm [18:51:12] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [18:51:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir3004.esams.wmnet with OS bullseye [18:53:06] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:53:10] !log Rolling Cassandra restart codfw/b — T339298 [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:13] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [18:54:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3003.esams.wmnet with OS bullseye [18:54:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [18:55:58] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:56:06] (03PS1) 10Ssingh: conf-tool/esams: add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950026 (https://phabricator.wikimedia.org/T344355) [18:56:16] PROBLEM - cassandra-a SSL 10.192.16.82:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:57:10] that's me ^^^ [18:57:24] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:58:06] I was so sure that would come under the threshold of an alert 🤨 [18:58:50] PROBLEM - cassandra-b SSL 10.192.16.83:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:58:50] (03PS1) 10Gehel: query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) [18:59:18] (03CR) 10CI reject: [V: 04-1] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [19:00:04] oh! 🤦‍♂️ [19:00:42] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.16.82:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:00:42] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.83:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:00:42] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.16.84:7001 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans SSL port has moved https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:01:17] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host durum4001.ulsfo.wmnet with OS bookworm [19:01:39] (03PS2) 10Gehel: query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) [19:02:30] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm [19:02:45] (03CR) 10Ryan Kemper: [C: 03+1] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [19:03:57] (03CR) 10Gehel: [C: 03+2] query_service: fix glob expansion in blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/950027 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [19:07:34] (03PS1) 10Eevans: restbase: Use port 7000 for ssl monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/950028 (https://phabricator.wikimedia.org/T339298) [19:07:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [19:09:29] (03CR) 10Eevans: [C: 03+2] restbase: Use port 7000 for ssl monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/950028 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [19:11:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir3004.esams.wmnet with reason: host reimage [19:13:29] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:09] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:20:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS bookworm [19:21:43] (03PS1) 10Gehel: Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008 [19:22:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5002.eqsin.wmnet with OS bookworm [19:22:25] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [19:22:41] (03CR) 10Ryan Kemper: [C: 03+1] Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008 (owner: 10Gehel) [19:22:48] (03CR) 10Gehel: [C: 03+2] Revert "query_service: fix glob expansion in blazegraph systemd unit" [puppet] - 10https://gerrit.wikimedia.org/r/950008 (owner: 10Gehel) [19:23:18] (03PS1) 10Gehel: Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009 [19:23:39] (03CR) 10Ryan Kemper: [C: 03+1] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009 (owner: 10Gehel) [19:24:01] (03CR) 10Gehel: [C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/950009 (owner: 10Gehel) [19:24:27] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:24:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:25:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [19:35:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir3004.esams.wmnet with OS bullseye [19:36:47] (03CR) 10Thcipriani: "The current default for Gerrit is 64. So 8 is probably OK 😊" [puppet] - 10https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: 10Jelto) [19:40:10] PROBLEM - cassandra-a SSL 10.64.48.234:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:51:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:42] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:54:48] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] brennen and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230817T2000). [20:00:06] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:55] !log Rolling Cassandra restart codfw/c (RESTBase cluster) — T339298 [20:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:09] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:06:00] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:08:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:32] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:11:12] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:12:16] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bookworm [20:12:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:15:04] (ConfdResourceFailed) firing: (144) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:15:55] (03PS1) 10Thcipriani: Add newline to README for backport training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036 [20:19:45] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverses for Lumen transport esams eqiad - cmooney@cumin1001" [20:20:00] !log Rolling Cassandra restart codfw/d (RESTBase cluster) — T339298 [20:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:05] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:20:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036 (owner: 10Thcipriani) [20:20:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverses for Lumen transport esams eqiad - cmooney@cumin1001" [20:20:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:21:02] (03Merged) 10jenkins-bot: Add newline to README for backport training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950036 (owner: 10Thcipriani) [20:21:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:16] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:950036|Add newline to README for backport training]] [20:22:55] !log thcipriani@deploy1002 thcipriani: Backport for [[gerrit:950036|Add newline to README for backport training]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:25:26] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:39] !log thcipriani@deploy1002 thcipriani: Continuing with sync [20:30:53] !log Rolling Cassandra restart eqiad/a (RESTBase cluster) — T339298 [20:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:05] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:34:46] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:950036|Add newline to README for backport training]] (duration: 13m 29s) [20:35:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) [20:35:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:14] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:38:20] (03PS1) 10Thcipriani: Revert "Add newline to README for backport training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011 [20:38:33] (03CR) 10Ssingh: [C: 03+2] conf-tool/esams: add ncredir300[34] [puppet] - 10https://gerrit.wikimedia.org/r/950026 (https://phabricator.wikimedia.org/T344355) (owner: 10Ssingh) [20:38:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:39:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011 (owner: 10Thcipriani) [20:39:49] (03Merged) 10jenkins-bot: Revert "Add newline to README for backport training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950011 (owner: 10Thcipriani) [20:40:05] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]] [20:40:13] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1010.eqiad.wmnet with OS bullseye [20:40:22] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1011.eqiad.wmnet with OS bullseye [20:40:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:41:32] !log thcipriani@deploy1002 thcipriani: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:41:38] !log restart pybal on lvs3010 [20:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:02] sukhe: are we about to trigger that thing where scap depools everything if pybal is down? [20:43:09] thcipriani: nope :) [20:43:13] that has been resolved so should be fine [20:43:18] \o/ [20:43:33] cool, alright, I'll continue my sync then if all's fine [20:43:41] thcipriani: please do [20:43:52] <2 [20:43:54] er [20:43:56] <3 [20:44:34] !log thcipriani@deploy1002 thcipriani: Continuing with sync [20:44:38] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1010.eqiad.wmnet with OS bullseye [20:46:08] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:46:19] ^ should resolve [20:46:24] soonish, in progress [20:46:53] !log restart pybal on lvs3008 [20:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:49:16] !log Rolling Cassandra restart eqiad/b (RESTBase cluster) — T339298 [20:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:19] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:50:48] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:950011|Revert "Add newline to README for backport training"]] (duration: 10m 43s) [20:51:24] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:51:50] (03PS7) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [20:53:28] (03CR) 10Ssingh: "New change is the removal of the comment before ncredir, as we reimaged ncredir300[34], so will be pooling that as well. Previously we dec" [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [20:54:03] (03CR) 10Ssingh: "Do not merge before Monday." [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [20:55:26] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage [20:56:06] !log Rolling Cassandra restart eqiad/d (RESTBase cluster) — T339298 [20:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:13] T339298: Upgrade restbase cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339298 [20:56:50] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [20:58:00] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: host reimage [21:01:26] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:03:35] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:03:39] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:09:34] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:10:45] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:11:00] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) replaced the system board and the controller. System still did not post. pulled out everything except 1 ram, 1 cpu, a psu. Booted and started adding back components. Found... [21:11:24] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@1d60a29]: make wikibase ttl imports to hdfs world readable [21:11:36] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@1d60a29]: make wikibase ttl imports to hdfs world readable (duration: 00m 11s) [21:16:21] (03PS1) 10Cathal Mooney: Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042 [21:17:34] (03CR) 10Cathal Mooney: [C: 03+2] Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042 (owner: 10Cathal Mooney) [21:18:05] (03Merged) 10jenkins-bot: Correct IP for Arelion BGP peering esams. [homer/public] - 10https://gerrit.wikimedia.org/r/950042 (owner: 10Cathal Mooney) [21:21:54] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:21:57] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:22:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:24:28] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:25:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:14] (03CR) 10Jon Harald Søby: "Surely the Page and Index namespaces should be set in the ProofreadPage extension's ProofreadPage.namespaces.php instead of WMF config, n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [21:27:10] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:29:08] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:15] (03CR) 10Jon Harald Søby: Some initial configurations for suwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [21:34:01] (03CR) 10Jon Harald Søby: Some initial configurations for suwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949183 (https://phabricator.wikimedia.org/T344314) (owner: 10Anzx) [21:34:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 5/5 UP : 4 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:34:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1011.eqiad.wmnet with OS bullseye [21:34:43] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:40:38] !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 [21:40:42] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [21:40:54] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 16s) [21:41:20] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:46:32] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5003 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:46:39] (03Abandoned) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [21:51:50] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns5004 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:54:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:57:14] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6001 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:59:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:54] RECOVERY - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns6002 is OK: OK: ntp.service was restarted after /etc/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [22:03:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:47] (03PS1) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 [22:05:46] (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney) [22:08:32] (03PS2) 10Cathal Mooney: Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 [22:09:26] (03CR) 10CI reject: [V: 04-1] Reverse includes for new esams range [dns] - 10https://gerrit.wikimedia.org/r/950045 (owner: 10Cathal Mooney) [22:16:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:47] (03PS1) 10DDesouza: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) [22:20:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:08] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [22:29:02] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:58] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10odimitrijevic) Approving group membership [22:31:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [22:33:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [23:03:04] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:22] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:37] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)