[00:04:35] zabe: Updated the patch for T330968 that should address the issue you were facing. Forgot to check how IPs as the target would be considered. [00:05:35] yup, will give it another try [00:05:47] Thanks! [00:14:14] !log Deployed patch for T330968 [00:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930840 [00:39:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930840 (owner: 10TrainBranchBot) [00:40:16] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) Added to the `nda` LDAP group. Please re-open if there are any issues, thanks! [00:40:21] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) 05In progress→03Resolved [00:43:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T339884 (10phaultfinder) [01:04:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/930840 (owner: 10TrainBranchBot) [01:37:38] PROBLEM - dump of s6 in eqiad on backupmon1001 is CRITICAL: Last dump for s6 at eqiad (db1140) taken on 2023-06-20 00:00:12 is 71 GiB, but the previous one was 102 GiB, a change of -30.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:52:10] (03PS2) 10KartikMistry: Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931260 (https://phabricator.wikimedia.org/T337834) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0200) [02:00:14] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:20] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] PROBLEM - dump of s6 in codfw on backupmon1001 is CRITICAL: Last dump for s6 at codfw (db2141) taken on 2023-06-20 00:00:02 is 71 GiB, but the previous one was 102 GiB, a change of -30.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:58] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:02] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0300) [03:17:42] (CertAlmostExpired) firing: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:30:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:28] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp1085.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_80: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1079.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:33:08] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp1075.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:34:42] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:35:07] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:49:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:50:22] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:51:37] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:12:42] (CertAlmostExpired) firing: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:16:44] RECOVERY - dump of db_inventory in codfw on backupmon1001 is OK: Last dump for db_inventory at codfw (db2185) taken on 2023-06-20 03:55:22 (94 KiB, +5.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:17:42] (CertAlmostExpired) resolved: (6) Certificate for service miscweb1003:30443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:49:07] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:12] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_80: Servers cp1085.eqiad.wmnet, cp1089.eqiad.wmnet, cp1075.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:53:46] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:59:07] (ProbeDown) resolved: (2) Service text:80 has failed probes (http_text_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:15:55] (03Abandoned) 10Simon04: Enable the Wikibase REST API on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921612 (https://phabricator.wikimedia.org/T337141) (owner: 10Simon04) [05:31:49] (03CR) 10Ayounsi: "Other than the mechanism itself (I don't know varnish enough to review it), I worry that this list becomes quite complex to manage." [puppet] - 10https://gerrit.wikimedia.org/r/930941 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [05:33:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 14860 [05:33:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14860 [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0600) [06:00:06] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0600). [06:06:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:12:06] (03CR) 10Ayounsi: [C: 03+1] interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [06:13:20] (03CR) 10Ayounsi: [C: 03+1] "Note that I don't think the change will be taken into account before a host reboot. So the hosts should be rebooted or updated manually fo" [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [06:14:11] (03CR) 10Ayounsi: [C: 03+1] "+1 only based on the tests passing and the overall logic, but I don't know Ruby enough to fully review it." [puppet] - 10https://gerrit.wikimedia.org/r/931236 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [06:26:16] (03PS1) 10Marostegui: install_server: Reimage db1124 [puppet] - 10https://gerrit.wikimedia.org/r/931493 (https://phabricator.wikimedia.org/T339835) [06:26:18] (03PS1) 10Marostegui: db1124: Include comment about Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931494 [06:26:57] (03CR) 10Marostegui: [C: 03+2] db1124: Include comment about Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/931494 (owner: 10Marostegui) [06:27:03] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1124 [puppet] - 10https://gerrit.wikimedia.org/r/931493 (https://phabricator.wikimedia.org/T339835) (owner: 10Marostegui) [06:27:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1119.eqiad.wmnet with OS bookworm [06:33:18] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:34:57] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1119.eqiad.wmnet with OS bookworm [06:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:48:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41821/console" [puppet] - 10https://gerrit.wikimedia.org/r/931298 (https://phabricator.wikimedia.org/T339300) (owner: 10Elukey) [06:48:28] (03PS1) 10Marostegui: install_server: Do not format db1124 /srv [puppet] - 10https://gerrit.wikimedia.org/r/931497 [06:49:01] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1124 /srv [puppet] - 10https://gerrit.wikimedia.org/r/931497 (owner: 10Marostegui) [06:52:36] (03PS1) 10Elukey: role::cache::{text,upload}: move vk instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) [06:54:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41822/console" [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:58:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41823/console" [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:58:28] 10ops-eqiad, 10DBA: Power drain + restart idrac for db1119 - https://phabricator.wikimedia.org/T339889 (10Marostegui) [06:58:39] 10ops-eqiad, 10DBA: Power drain + restart idrac for db1119 - https://phabricator.wikimedia.org/T339889 (10Marostegui) p:05Triage→03Medium [07:00:04] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T0700). [07:00:05] kart_, dcausse, and MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] * kart_ is here [07:00:28] o/ [07:00:49] (03PS1) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [07:00:57] hi [07:01:38] (03CR) 10CI reject: [V: 04-1] C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:02:18] I'll self-deploy my patch.. [07:02:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931260 (https://phabricator.wikimedia.org/T337834) (owner: 10KartikMistry) [07:03:00] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T337126#8947666, @ssingh wrote: > Added to the `nda` LDAP group. Please re-open if there are any issues, thanks! cn=nda needs a tracking entry in data.... [07:03:26] (03PS2) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [07:03:28] (03Merged) 10jenkins-bot: Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931260 (https://phabricator.wikimedia.org/T337834) (owner: 10KartikMistry) [07:04:03] !log kartik@deploy1002 Started scap: Backport for [[gerrit:931260|Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT (T337834)]] [07:04:07] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [07:05:35] !log kartik@deploy1002 kartik: Backport for [[gerrit:931260|Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT (T337834)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:06:44] (03PS1) 10Muehlenhoff: Retire legacy "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/931500 (https://phabricator.wikimedia.org/T313312) [07:07:10] !log installing openssl securit updates on buster [07:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:28] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:931260|Enable Content and Section Translation for a 3rd group of 10 languages previously lacking MT (T337834)]] (duration: 10m 25s) [07:14:32] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [07:15:00] I'm done with my patch :) [07:15:54] ok, doing the same with mine, tho, it does not actually require a deploy, will just rebase the deployment server [07:16:30] (03CR) 10DCausse: [C: 03+2] token_count_router: infer the analyzer from the field (followup) [extensions/CirrusSearch] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931271 (https://phabricator.wikimedia.org/T339810) (owner: 10DCausse) [07:18:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1119.eqiad.wmnet with OS bookworm [07:18:15] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-codfw [07:19:08] 10ops-eqiad, 10DBA: Power drain + restart idrac for db1119 - https://phabricator.wikimedia.org/T339889 (10Marostegui) 05Open→03Resolved I have finally managed to get it back with a cold restart via install_console [07:20:14] !log ariel@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1003.eqiad.wmnet with OS bullseye [07:22:19] (03PS3) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [07:22:43] (03CR) 10CI reject: [V: 04-1] C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:23:56] (03PS4) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [07:26:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage [07:26:27] (03PS2) 10MVernon: profile::thanos::swift: add machinetranslation user [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) [07:26:58] (03CR) 10MVernon: profile::thanos::swift: add machinetranslation user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [07:29:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage [07:32:32] (03CR) 10Slyngshede: "There is another patch that completely removed this script, but talking to people earlier indicated that it's still used." [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [07:35:00] (03Merged) 10jenkins-bot: token_count_router: infer the analyzer from the field (followup) [extensions/CirrusSearch] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931271 (https://phabricator.wikimedia.org/T339810) (owner: 10DCausse) [07:35:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) a:03SLyngshede-WMF [07:35:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) I've added patches for the remaining three Pyth... [07:37:49] ok, I'm done with my patch [07:40:28] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: upgrade to Cassandra 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931298 (https://phabricator.wikimedia.org/T339300) (owner: 10Elukey) [07:40:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-codfw [07:41:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931500 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [07:42:27] (03CR) 10Vgutierrez: "I guess that PCC issues have been fixed since last week, please re-run PCC against the whole set of impacted hosts (one cp host per cluste" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:46:56] (03PS15) 10Muehlenhoff: ferm: Allow passing the port is a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [07:50:31] (03CR) 10Muehlenhoff: ferm: Allow passing the port is a more structured way (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:56:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41825/console" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [07:57:24] (03CR) 10Jbond: [C: 04-1] "sere inline" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [07:57:26] (03CR) 10Vgutierrez: [C: 03+1] "nice, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [08:05:58] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:06:18] (03CR) 10Jbond: [C: 03+2] wmflib: Add new function to convert from a netmask to cidr [puppet] - 10https://gerrit.wikimedia.org/r/931236 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [08:06:58] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1003.eqiad.wmnet with OS bullseye [08:10:04] (03PS5) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [08:10:55] (03CR) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:12:24] (03PS2) 10Elukey: role::cache::{text,upload}: move vk codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) [08:13:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:14:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:15:30] (03CR) 10Jbond: "oh i never sent this" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:15:37] (03PS6) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) [08:16:23] (03CR) 10Slyngshede: C:beta:autoupdate Move wmf-beta-autoupdate to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:17:22] (03CR) 10Elukey: "John: o/ this is an interesting use case, since the cassandra instances are using a custom CA in every cluster, so in theory to be able to" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:17:47] (03PS1) 10Marostegui: packages_wmf.pp: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/931547 (https://phabricator.wikimedia.org/T339185) [08:20:12] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/931547 (https://phabricator.wikimedia.org/T339185) (owner: 10Marostegui) [08:28:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:37:14] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:elastic-eqiad [08:37:30] !log ariel@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1003.eqiad.wmnet with OS bullseye [08:38:36] (03PS1) 10Marostegui: packages_wmf.pp: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/931549 [08:39:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1119.eqiad.wmnet with OS bookworm [08:41:10] (03CR) 10Majavah: [C: 03+1] packages_wmf.pp: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/931549 (owner: 10Marostegui) [08:41:22] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/931549 (owner: 10Marostegui) [08:42:43] (03PS8) 10Jbond: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:44:37] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:45:04] (03PS1) 10Fabfur: [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931550 (https://phabricator.wikimedia.org/T327742) [08:45:14] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy NLLB model [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:46:15] (03Merged) 10jenkins-bot: ml-services: deploy NLLB model [deployment-charts] - 10https://gerrit.wikimedia.org/r/931290 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:48:31] (03CR) 10Jbond: NetboxInventory: use GraphQL and save ~30s at each run (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [08:49:06] (03CR) 10Jbond: [C: 03+1] C:beta:autoupdate Move wmf-beta-autoupdate to files. [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [08:53:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [08:54:54] jbond: o/ thanks for the review --^ I left a comment for you in the change-log, lemme know your thoughts when you have time [08:56:53] (03PS1) 10Marostegui: core_test.pp: Remove MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/931553 [08:59:15] (03CR) 10Marostegui: [C: 03+2] core_test.pp: Remove MariaDB 10.1 [puppet] - 10https://gerrit.wikimedia.org/r/931553 (owner: 10Marostegui) [09:00:56] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [09:01:56] (03CR) 10Jbond: [C: 03+1] cassandra: add initial support for PKI TLS certs to 4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:02:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:elastic-eqiad [09:02:21] (03PS7) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:02:34] (03CR) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:02:51] (03CR) 10CI reject: [V: 04-1] cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:03:43] (03PS8) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:04:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:05:08] (03CR) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:08:19] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: enable LDAP backups [puppet] - 10https://gerrit.wikimedia.org/r/931558 (https://phabricator.wikimedia.org/T339894) [09:12:15] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/931558/41826/" [puppet] - 10https://gerrit.wikimedia.org/r/931558 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:18:06] (03PS1) 10Slyngshede: new_user() do not set CN. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/931559 [09:19:30] (03CR) 10CI reject: [V: 04-1] new_user() do not set CN. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/931559 (owner: 10Slyngshede) [09:20:26] (03PS1) 10Slyngshede: ldapbackend: Capitalized CN and SN [software/bitu] - 10https://gerrit.wikimedia.org/r/931560 [09:20:52] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:23:01] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1003.eqiad.wmnet with OS bullseye [09:23:29] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:50] (03CR) 10Muehlenhoff: Add a cookbook to drain a Ganeti node (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [09:24:15] (03PS2) 10Slyngshede: new_user() do not set CN. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/931559 [09:24:46] (03CR) 10Jcrespo: [C: 03+1] "This looks good to me. I don't have context as if it will match what is intended, but we can test a recovery afterwards, no issue- as far " [puppet] - 10https://gerrit.wikimedia.org/r/931558 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:26:15] (03PS1) 10Jbond: pki: add puppet intermediate [puppet] - 10https://gerrit.wikimedia.org/r/931562 [09:26:31] (03CR) 10Jbond: [C: 03+2] pki: add puppet intermediate [puppet] - 10https://gerrit.wikimedia.org/r/931562 (owner: 10Jbond) [09:28:24] (03CR) 10Vgutierrez: [C: 03+1] [beta] Update wgCdnServersNoPurge (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931550 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [09:30:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/931560 (owner: 10Slyngshede) [09:31:12] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] ldapbackend: Capitalized CN and SN [software/bitu] - 10https://gerrit.wikimedia.org/r/931560 (owner: 10Slyngshede) [09:32:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudservices: codfw1dev: enable LDAP backups [puppet] - 10https://gerrit.wikimedia.org/r/931558 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [09:33:05] (03PS9) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:33:08] (03PS5) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [09:33:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/931559 (owner: 10Slyngshede) [09:34:18] (03CR) 10Slyngshede: [C: 03+2] new_user() do not set CN. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/931559 (owner: 10Slyngshede) [09:34:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41827/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:35:41] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:37:18] (03PS1) 10Majavah: cloudlb: firewall: use new ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/931563 [09:37:20] (03PS1) 10Majavah: P:openstack: use dnsquery::resolve [puppet] - 10https://gerrit.wikimedia.org/r/931564 [09:37:52] (03CR) 10CI reject: [V: 04-1] cloudlb: firewall: use new ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [09:38:37] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41828/console" [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [09:38:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41829/console" [puppet] - 10https://gerrit.wikimedia.org/r/931564 (owner: 10Majavah) [09:39:48] (03PS2) 10Majavah: cloudlb: firewall: use new ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/931563 [09:39:50] (03PS2) 10Majavah: P:openstack: use dnsquery::resolve [puppet] - 10https://gerrit.wikimedia.org/r/931564 [09:40:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41831/console" [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [09:41:04] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41830/console" [puppet] - 10https://gerrit.wikimedia.org/r/931564 (owner: 10Majavah) [09:44:07] (03PS10) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:44:09] (03PS6) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [09:44:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:45:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41832/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:48:00] (03PS1) 10Majavah: pcc: include CORE_DIFF in summaries [puppet] - 10https://gerrit.wikimedia.org/r/931566 [09:48:02] (03PS11) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:48:04] (03PS7) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [09:49:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41833/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:49:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:49:43] 10SRE, 10Wikidata, 10wdwb-tech, 10Shape Expressions (M2: Linking to EntitySchemas in statements), and 3 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Arian_Bozorg) [09:50:20] jouncebot: nowandnext [09:50:20] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [09:50:20] In 0 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1000) [09:50:46] * Lucas_WMDE waits for 9 minutes to find out if that window is empty or not [09:51:54] (03PS12) 10Elukey: cassandra: add initial support for PKI TLS certs to 4.x [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) [09:51:56] (03PS8) 10Elukey: role::ml_cache::storage: enable PKI tls certs [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) [09:52:14] ACKNOWLEDGEMENT - dump of s6 in codfw on backupmon1001 is CRITICAL: Last dump for s6 at codfw (db2141) taken on 2023-06-20 00:00:02 is 71 GiB, but the previous one was 102 GiB, a change of -30.5 % Jcrespo Expected https://phabricator.wikimedia.org/P49455 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:52:14] ACKNOWLEDGEMENT - dump of s6 in eqiad on backupmon1001 is CRITICAL: Last dump for s6 at eqiad (db1140) taken on 2023-06-20 00:00:12 is 71 GiB, but the previous one was 102 GiB, a change of -30.5 % Jcrespo Expected https://phabricator.wikimedia.org/P49455 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:52:20] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41834/console" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:54:24] (03PS3) 10Ladsgroup: Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) [09:55:01] (03CR) 10Elukey: "@Jbond: I tweaked a bit this change following your suggestion, and then used profile::base::certificates' power to create the new bundle/t" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [09:55:41] (03PS1) 10Jbond: puppetserver: Add CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/931567 [09:57:14] (03CR) 10Jbond: "lgtm the diff seems to be ordering" [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [09:57:19] (03CR) 10Jbond: [C: 03+1] cloudlb: firewall: use new ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [09:57:32] (03CR) 10Jbond: [C: 03+2] puppetserver: Add CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/931567 (owner: 10Jbond) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1000) [10:00:37] SREs: I’d like to deploy some security patches – do you need the infra window or can I take it over? [10:02:06] (03PS1) 10Hnowlan: changeprop: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/931568 (https://phabricator.wikimedia.org/T338765) [10:02:53] (03CR) 10EoghanGaffney: sre: add gitlab ci alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [10:03:27] alright, I’ll go ahead with deploying [10:04:42] Lucas_WMDE: once done, ping me [10:06:00] ok [10:06:13] > Last dump for s6 at eqiad (db1140) taken on 2023-06-20 00:00:12 is 71 GiB, but the previous one was 102 GiB, a change of -30.5 % [10:06:19] yikes [10:06:26] marostegui: that's moving to ES for wikitech content [10:06:39] fyi jynus as well [10:07:23] Lucas_WMDE: why yikes? [10:08:26] idk it sounded like data might have been lost [10:08:31] but I’m probably just missing some context :) [10:08:40] https://phabricator.wikimedia.org/P49455#200106 [10:08:43] * Lucas_WMDE syncing [10:08:46] (03PS4) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [10:10:12] Lucas_WMDE: nah, wikitech had content of all of its revision stored in s6 core tables. We moved them to external storage [10:10:26] ah, I see [10:10:42] so, the way it’s been on all other wikis for years and years? ^^ [10:10:51] yup [10:10:57] yay for wikitech becoming less special \o/ [10:11:10] labswiki didn't use to be part of the database cluster [10:11:31] it has been like this for portions of basically every wiki (I just learned), like enwiki had 5M edits stored in text table. [10:11:34] the client still isn't AFAIK [10:11:39] frwiki was 2m or something [10:12:12] yup, that's https://phabricator.wikimedia.org/T292707#8930327 [10:13:20] (03PS1) 10Majavah: labswiki: Use ExternalStore by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931570 [10:13:22] Amir1: ^ [10:13:40] cool [10:14:07] I'll review this soon [10:14:29] apparently I’ve already forgotten how to do deploys without `scap backport`, because I was just confused why the `scap sync-file` didn’t wait for me to confirm the change was ok on mwdebug 🤦 [10:14:35] ah well, there wasn’t a whole lot to test anyway [10:14:45] (and we still have the canaries) [10:15:33] (03PS2) 10FNegri: cumin: Increase connect_timeout for slow servers [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [10:15:38] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) [[ https://grafana.wikimedia.org/goto/0JYX92u4z?orgId=1 | During the issue ]] text@esams never went higher than ~400 rps on port 80 per instance: {F37110208} [[ https:... [10:16:12] !log deployed patches for T339111 [10:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:28] Amir1: I’m done [10:17:35] awesome [10:17:44] (03PS4) 10Ladsgroup: Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) [10:17:56] RECOVERY - Check systemd state on analytics1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:57] (03CR) 10Ladsgroup: [C: 03+2] Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [10:18:19] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [10:18:52] (03Merged) 10jenkins-bot: Stop setting wgLegacyEncdoing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [10:18:57] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) pybal on lvs3005 and lvs3007 didn't report any healthcheck failures during the issue (besides the expected one for cp3050/cp3051 under maintenance at that moment) [10:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:22:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]] [10:22:37] T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151 [10:22:38] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [10:23:54] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [10:27:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/931568 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [10:29:57] (03PS3) 10FNegri: cumin: Increase connect_timeout for slow servers [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [10:30:37] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]] (duration: 08m 06s) [10:30:42] T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki - https://phabricator.wikimedia.org/T128151 [10:30:43] T128150: Stop needing to use wgLegacyEncoding in Wikimedia cluster production - https://phabricator.wikimedia.org/T128150 [10:31:02] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) for both IPv4 and IPv6 the alert reports "context deadline exceeded": ` target=http://[91.198.174.192]:80/wiki/Special:BlankPage msg="Error for HTTP request" err="Get \... [10:32:31] (03CR) 10Hnowlan: [C: 03+2] changeprop: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/931568 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [10:33:31] (03Merged) 10jenkins-bot: changeprop: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/931568 (https://phabricator.wikimedia.org/T338765) (owner: 10Hnowlan) [10:34:32] hnowlan: we could capture jobs inserted with kafkacat and see what's causing this storm [10:35:42] (03PS16) 10Muehlenhoff: ferm: Allow passing the port in a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) [10:36:24] (03PS1) 10Ilias Sarantopoulos: ml-services: fix nllb default input parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/931575 (https://phabricator.wikimedia.org/T333861) [10:36:38] Amir1: yeah, good point. I'm fairly certain it's just someone doing some kind of bulk upload. In general it seems we'd be vulnerable to this by anyone doing that [10:36:54] yeah [10:38:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931292 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:38:25] (03CR) 10Jbond: cassandra: add initial support for PKI TLS certs to 4.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [10:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:39:06] (03CR) 10Muehlenhoff: [C: 03+2] ferm: Allow passing the port in a more structured way [puppet] - 10https://gerrit.wikimedia.org/r/930656 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:39:54] (03CR) 10Jbond: [C: 03+2] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/931566 (owner: 10Majavah) [10:40:09] (03CR) 10Hashar: [C: 03+1] "PCC does not necessarily gives much information at https://puppet-compiler.wmflabs.org/output/927674/1979/" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [10:40:20] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix nllb default input parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/931575 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:41:13] (03Merged) 10jenkins-bot: ml-services: fix nllb default input parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/931575 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:43:41] (03CR) 10Hashar: [C: 03+1] "The script is running on `deployment-deploy03`, once done I will check the Jenkins jobs are still properly working." [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [10:44:36] (03PS1) 10Majavah: cr-labs: Permit bacula backup traffic [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T339894) [10:46:44] (03PS2) 10Majavah: cr-labs: Permit bacula backup traffic [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T339894) [10:48:28] (03PS3) 10Majavah: cr-labs: Permit bacula backup traffic [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T338132) [10:50:10] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] registry: Add nginx logs to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/930719 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [10:50:15] (03PS1) 10Muehlenhoff: PCC: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931578 [10:50:31] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: extend ldap-codfw1dev certs for internal names [puppet] - 10https://gerrit.wikimedia.org/r/931579 (https://phabricator.wikimedia.org/T339905) [10:50:39] (03CR) 10CI reject: [V: 04-1] PCC: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931578 (owner: 10Muehlenhoff) [10:52:35] (03PS1) 10Muehlenhoff: noc: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931581 [10:53:09] (03PS2) 10Muehlenhoff: PCC: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931578 [10:54:02] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [10:54:58] (03CR) 10CI reject: [V: 04-1] noc: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [10:57:02] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:02] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:59:26] (03CR) 10Vgutierrez: [C: 03+1] "acme-chief will be able to issue this one as wikimedia.cloud is hosted by our public DNS infrastructure but it looks weird to have a publi" [puppet] - 10https://gerrit.wikimedia.org/r/931579 (https://phabricator.wikimedia.org/T339905) (owner: 10Arturo Borrero Gonzalez) [10:59:29] (03PS2) 10Muehlenhoff: noc: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931581 [11:00:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: extend ldap-codfw1dev certs for internal names [puppet] - 10https://gerrit.wikimedia.org/r/931579 (https://phabricator.wikimedia.org/T339905) (owner: 10Arturo Borrero Gonzalez) [11:01:22] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (cloudservices2004-dev), Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:01:34] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:49] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: Consolidate http redirection directive across all DCs [puppet] - 10https://gerrit.wikimedia.org/r/929989 (https://phabricator.wikimedia.org/T323557) (owner: 10Fabfur) [11:05:01] (03PS1) 10Muehlenhoff: Tighten type [puppet] - 10https://gerrit.wikimedia.org/r/931582 [11:05:09] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41835/console" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [11:07:31] (03CR) 10Hashar: [C: 03+1] "Of course I broke it https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [11:09:47] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931582 (owner: 10Muehlenhoff) [11:10:07] (03CR) 10MVernon: profile::thanos::swift: add machinetranslation user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [11:10:15] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:10:32] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:12:10] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:13:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:14:24] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:15:05] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:15:10] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:36] (03CR) 10Hashar: [C: 03+1] "Well after a couple more runs https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ passes just fine :]" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [11:22:10] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) >>! In T337126#8947842, @MoritzMuehlenhoff wrote: >>>! In T337126#8947666, @ssingh wrote: >> Added to the `nda` LDAP group. Please re-open if there are any issues, thanks! > > cn=nda needs a... [11:22:29] (03PS1) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [11:22:55] (03CR) 10CI reject: [V: 04-1] cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [11:23:38] (03PS2) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [11:24:05] (03CR) 10CI reject: [V: 04-1] cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [11:24:31] (03PS1) 10Ilias Sarantopoulos: ml-services: correctly override values in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/931585 (https://phabricator.wikimedia.org/T334583) [11:24:56] (03PS3) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [11:25:59] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) [Awaiting user input on email] [11:26:11] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) [11:27:43] !log jnuche@deploy1002 deploy aborted: (no justification provided) (duration: 01m 32s) [11:29:25] (03PS4) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [11:30:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41837/console" [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [11:33:03] (03CR) 10Klausman: [C: 03+1] ml-services: correctly override values in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/931585 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [11:34:00] (03PS5) 10Majavah: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 [11:35:25] (03CR) 10Ladsgroup: [C: 03+1] "Now that it has access, it can move forward but let's coordinate before deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931570 (owner: 10Majavah) [11:35:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41838/console" [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [11:36:50] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): investigate using cfssl to provide a itermediate certificate for puppetserver - https://phabricator.wikimedia.org/T339913 (10jbond) [11:37:45] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [11:38:32] (03CR) 10Muehlenhoff: [C: 03+2] Tighten type [puppet] - 10https://gerrit.wikimedia.org/r/931582 (owner: 10Muehlenhoff) [11:39:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Permit bacula backup traffic [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T338132) (owner: 10Majavah) [11:45:00] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): investigate using cfssl to provide a itermediate certificate for puppetserver - https://phabricator.wikimedia.org/T339913 (10jbond) p:05Triage→03Medium [11:45:58] (03Abandoned) 10Jbond: puppetserver: Add CA certificate [puppet] - 10https://gerrit.wikimedia.org/r/931567 (owner: 10Jbond) [11:47:36] (03PS1) 10Jbond: root_pki: add support for creating rsa intermediates [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) [11:48:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41839/console" [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [11:48:39] (03PS2) 10Jbond: root_pki: add support for creating rsa intermediates [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) [11:56:59] (03PS5) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [11:59:39] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931589 (https://phabricator.wikimedia.org/T307357) [11:59:53] hnowlan: yup, PDF is being uploaded P49456 [12:00:21] thankfully the enqueueNextpage seems to be working [12:01:07] Amir1: ah, good! [12:02:30] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: cleanup for ns.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931590 (https://phabricator.wikimedia.org/T307357) [12:02:34] haha, reducing job concurrency reducing jobs being inserted as well (since there is less jobs to queue more jobs) https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=ThumbnailRender&from=now-2d&to=now&viewPanel=1 [12:02:52] (03CR) 10Jbond: "puppet side looks good to me but left some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [12:04:04] (03CR) 10Jbond: [C: 04-1] "adding back the -1 from taavi see comments" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [12:06:13] (03PS3) 10Jbond: root_pki: add support for creating rsa intermediates [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) [12:07:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/924498 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:07:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41841/console" [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [12:07:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] root_pki: add support for creating rsa intermediates [puppet] - 10https://gerrit.wikimedia.org/r/931586 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [12:08:37] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: cleanup for ns.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931590 (https://phabricator.wikimedia.org/T307357) [12:08:39] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh A for ns1.openstack.codfw1dev.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/931589 (https://phabricator.wikimedia.org/T307357) [12:09:56] (03CR) 10Elukey: [C: 03+1] ml-services: correctly override values in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/931585 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [12:10:14] 10SRE, 10Data-Engineering, 10Traffic: Webrequest x_analtics `wprov` value is incorrectly formatted - https://phabricator.wikimedia.org/T339910 (10JAllemandou) [12:10:37] (03CR) 10FNegri: [V: 03+1] cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [12:11:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::thanos::swift: add machinetranslation user [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [12:12:17] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) [12:12:33] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) 05In progress→03Resolved [12:12:43] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [12:18:45] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) Run: ` update domains set master="185.15.57.25:5354 185.15.57.26:5354 172.20.5.8:5354 172.20.5.9:5354"; ` In both serv... [12:20:40] (03PS1) 10Hnowlan: thumbor: don't set x-forwarded-for at haproxy level [deployment-charts] - 10https://gerrit.wikimedia.org/r/931592 (https://phabricator.wikimedia.org/T339863) [12:20:52] (03CR) 10Klausman: [C: 03+2] changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [12:21:55] (03Merged) 10jenkins-bot: changeprop: set wiki_id match config for outlink stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/930610 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [12:23:23] (03PS1) 10Majavah: P:toolforge::bastion: install envvars-cli [puppet] - 10https://gerrit.wikimedia.org/r/931593 [12:24:30] (03CR) 10David Caro: [C: 03+2] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/931593 (owner: 10Majavah) [12:25:45] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:26:04] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:26:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41842/console" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:29:28] (03CR) 10Matthias Mullie: [ImageSuggestions] Process suggestions via job queue rather than sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:30:07] (03PS1) 10AOkoth: vrts: otrs1001 decom [puppet] - 10https://gerrit.wikimedia.org/r/931597 (https://phabricator.wikimedia.org/T339253) [12:31:56] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: correctly override values in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/931585 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [12:32:14] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) pybal timeout for ProxyFetch is set to 5s while prometheus blackbox http probe timeouts at 3s. this could explain the gap mentioned on https://phabricator.wikimedia.org... [12:32:30] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:32:45] (03Merged) 10jenkins-bot: ml-services: correctly override values in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/931585 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [12:32:49] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:37:00] !log depooling cp3050 - T339898 [12:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:04] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [12:37:43] (03CR) 10AOkoth: [C: 03+2] vrts: otrs1001 decom [puppet] - 10https://gerrit.wikimedia.org/r/931597 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [12:38:33] (03CR) 10Ladsgroup: [ImageSuggestions] Process suggestions via job queue rather than sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:38:38] (03PS2) 10Ladsgroup: [ImageSuggestions] Process suggestions via job queue rather than sync [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:38:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] [ImageSuggestions] Process suggestions via job queue rather than sync [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:40:40] (03PS1) 10Slyngshede: idm.wikimedia.org: Failover test [dns] - 10https://gerrit.wikimedia.org/r/931599 (https://phabricator.wikimedia.org/T338008) [12:41:20] (03PS2) 10Slyngshede: idm.wikimedia.org: Failover test [dns] - 10https://gerrit.wikimedia.org/r/931599 (https://phabricator.wikimedia.org/T338008) [12:43:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/930763 (owner: 10Slyngshede) [12:44:55] (03CR) 10Slyngshede: [C: 03+2] P:IDM Failover Redis to CODFW. [puppet] - 10https://gerrit.wikimedia.org/r/930763 (owner: 10Slyngshede) [12:45:10] (03PS1) 10EoghanGaffney: apt: Add jenkins packages to bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/931600 (https://phabricator.wikimedia.org/T334435) [12:45:30] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Suggestion: Instead of incrementing jobNumber, you can basically give it lastPageId and make the batch go up to the next 1000. So it would" [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:46:47] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts otrs1001.eqiad.wmnet [12:47:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] [ImageSuggestions] Process suggestions via job queue rather than sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924877 (https://phabricator.wikimedia.org/T322872) (owner: 10Matthias Mullie) [12:47:33] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts otrs1001.eqiad.wmnet [12:50:20] (03CR) 10Ottomata: [V: 03+1 C: 03+2] "In case yall missed this one, spark_submit executable is now parameterized, and will need changed when upgrading to spark 3." [puppet] - 10https://gerrit.wikimedia.org/r/930713 (https://phabricator.wikimedia.org/T335308) (owner: 10Ottomata) [12:51:02] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:22] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: disk replacement for an-worker1110.eqiad.wmnet - https://phabricator.wikimedia.org/T336930 (10Jclark-ctr) @BTullis replaced slot 4 hdd with comparable dell 4tb sata drive [12:53:21] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: disk replacement for an-worker1110.eqiad.wmnet - https://phabricator.wikimedia.org/T336930 (10Jclark-ctr) 05Open→03Resolved [12:54:09] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: disk replacement for an-worker1110.eqiad.wmnet - https://phabricator.wikimedia.org/T336930 (10BTullis) Great! Many thanks @Jclark-ctr [12:57:02] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10MoritzMuehlenhoff) [12:57:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts parse1002.eqiad.wmnet [12:58:08] !log jclark@cumin1001 START - Cookbook sre.hosts.reboot-single for host parse1002.eqiad.wmnet [12:59:12] (03PS6) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1300). [13:00:06] albertoleoncio and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1300) [13:00:24] hi [13:00:31] * TheresNoTime will be in a meeting for 30 minutes [13:00:34] hi. i can deploy today. [13:00:38] unless someone else wants to :) [13:00:40] Hi! [13:00:41] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) [13:00:46] urbanecm: please do [13:00:56] o/ [13:01:03] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:23] (03PS4) 10Urbanecm: Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [13:01:28] (03CR) 10Urbanecm: [C: 03+2] Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [13:02:16] (03Merged) 10jenkins-bot: Enable Extension:Translate on pt.wikisource.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930189 (https://phabricator.wikimedia.org/T339139) (owner: 10Albertoleoncio) [13:04:02] !log Start foreachwikiindblist 'group2 & s1' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all on a tmux in mwmaint1002 (T315510) [13:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts otrs1001.eqiad.wmnet [13:04:06] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [13:04:23] MatmaRex: script running for enwiki/s1 now. [13:04:49] thanks [13:04:50] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:930189|Enable Extension:Translate on pt.wikisource.org (T339139)]] [13:04:53] T339139: Enable Extension:Translate on pt.wikisource.org - https://phabricator.wikimedia.org/T339139 [13:05:17] !log Create ext:Translate tables on ptwikisource (T339139) [13:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] !log urbanecm@deploy1002 albertoleoncio and urbanecm: Backport for [[gerrit:930189|Enable Extension:Translate on pt.wikisource.org (T339139)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:06:26] albertoleoncio: your patch's at mwdebug1001 now. translate should be available there. [13:06:29] can you check? [13:07:22] (03PS1) 10Bking: wdqs: restore permissions change for wdqs files [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) [13:07:33] Looking on it [13:08:29] (03CR) 10Gehel: [C: 03+1] "LGTM (minus the whitespace issue, see inline comment)" [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) (owner: 10Bking) [13:08:30] Seems ok! =D [13:08:37] agreed, proceeding :) [13:10:11] (03CR) 10CI reject: [V: 04-1] wdqs: restore permissions change for wdqs files [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) (owner: 10Bking) [13:10:11] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [13:10:23] 10SRE, 10ops-eqiad, 10DBA: Power drain + restart idrac for db1119 - https://phabricator.wikimedia.org/T339889 (10Jclark-ctr) @Marostegui Thank you for resolving ticket I am on site right now if you wanted me to look at this ping me on irc if needed [13:12:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [13:13:12] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10Jclark-ctr) 05Open→03Resolved Resolving ticket no additional alerts [13:13:42] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: otrs1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001" [13:14:01] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:930189|Enable Extension:Translate on pt.wikisource.org (T339139)]] (duration: 09m 11s) [13:14:05] T339139: Enable Extension:Translate on pt.wikisource.org - https://phabricator.wikimedia.org/T339139 [13:14:09] albertoleoncio: and, deployed. [13:14:12] anything else? [13:15:11] That's all for now. Thanks!! [13:15:14] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: otrs1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001" [13:15:15] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:15] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts otrs1001.eqiad.wmnet [13:15:24] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: Decom otrs1001 - https://phabricator.wikimedia.org/T339253 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: `otrs1001.eqiad.wmnet` - otrs1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [13:16:01] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: Decom otrs1001 - https://phabricator.wikimedia.org/T339253 (10Arnoldokoth) 05Open→03Resolved [13:16:07] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly OTRS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Arnoldokoth) [13:16:49] (03PS2) 10Bking: wdqs: restore permissions change for wdqs files [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) [13:17:15] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel: puppetserver1005 to puppetserver1001 - https://phabricator.wikimedia.org/T338326 (10Jclark-ctr) 05Open→03Resolved Relabled Servers with puppetserver1001 [13:17:36] (03CR) 10Gehel: [C: 03+1] wdqs: restore permissions change for wdqs files (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) (owner: 10Bking) [13:18:24] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly OTRS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10Arnoldokoth) 05Open→03Resolved [13:18:28] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Arnoldokoth) [13:18:28] !log installing python2.7 security updates [13:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:13] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:22:57] !log repooling cp3050 - T339898 [13:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:00] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [13:23:54] (03PS1) 10Ayounsi: Remove trusted-space [homer/public] - 10https://gerrit.wikimedia.org/r/931609 [13:27:21] (03PS1) 10AOkoth: vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) [13:27:34] (03CR) 10Hashar: [C: 04-1] "I think I am the one that mislead you on that front, sorry for the extra work :-\" [puppet] - 10https://gerrit.wikimedia.org/r/931499 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [13:27:44] (03PS2) 10AOkoth: vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) [13:29:10] (03PS1) 10Hnowlan: api-gateway: open access to device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/931612 [13:31:52] (03PS1) 10Ssingh: admin: add dreamyjazz to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/931613 (https://phabricator.wikimedia.org/T337126) [13:32:26] (03CR) 10EoghanGaffney: [C: 03+1] vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [13:33:03] (03CR) 10Bking: [C: 03+2] wdqs: restore permissions change for wdqs files [cookbooks] - 10https://gerrit.wikimedia.org/r/931603 (https://phabricator.wikimedia.org/T339368) (owner: 10Bking) [13:35:04] (03PS1) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [13:35:27] (03CR) 10Ssingh: [C: 03+2] admin: add dreamyjazz to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/931613 (https://phabricator.wikimedia.org/T337126) (owner: 10Ssingh) [13:36:15] (03PS1) 10Muehlenhoff: Add a type to the protocol used in a Ferm service [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) [13:37:29] (03CR) 10CI reject: [V: 04-1] puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [13:37:40] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339816 (10Jclark-ctr) a:03Jclark-ctr [13:38:18] (03CR) 10CI reject: [V: 04-1] Add a type to the protocol used in a Ferm service [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:38:46] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T339816 (10Jclark-ctr) 05Open→03Resolved robooted idrac , Changed cable , Moved to new port no change. moved to new port on Msw found all port was failing rebooted msw all servers returned with link [13:39:36] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10ssingh) 05Open→03Resolved Marking this as resolved again; thanks for @MoritzMuehlenhoff for pointing out that nda needed an ldap_only_users entry as well. [13:42:45] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10Ladsgroup) Can you unsub yourself from ops and stewards? I'll put you back [13:45:37] (03CR) 10Effie Mouzeli: [C: 03+1] "Nothing to lose tho have a go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931592 (https://phabricator.wikimedia.org/T339863) (owner: 10Hnowlan) [13:47:26] 10SRE, 10ops-eqiad, 10DBA: Power drain + restart idrac for db1119 - https://phabricator.wikimedia.org/T339889 (10Marostegui) Thank you, everything is looking good so far :) [13:49:01] (03PS2) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [13:49:28] (03CR) 10Hnowlan: [C: 03+2] api-gateway: open access to device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/931612 (owner: 10Hnowlan) [13:50:16] (03Merged) 10jenkins-bot: api-gateway: open access to device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/931612 (owner: 10Hnowlan) [13:53:31] (03PS3) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [13:54:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41845/console" [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [13:57:22] (03PS4) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [13:57:56] (03CR) 10CI reject: [V: 04-1] puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [13:58:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41846/console" [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [14:01:37] (03CR) 10Muehlenhoff: "The patch is fine as-is, but I think moving to a new OS would be a good opportunity to move to more fine-grained scheme for the CI package" [puppet] - 10https://gerrit.wikimedia.org/r/931600 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:02:42] (03PS7) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [14:03:30] !log fetch HAProxy 2.6.14 on thirdparty/haproxy26 for bullseye (apt.wm.o) [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:53] (03PS1) 10Ssingh: P:dns::recursor: remove obsolete buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/931619 [14:06:42] !log test HAProxy 2.6.14 on cp4044 and cp4051 [14:06:47] (03CR) 10Majavah: [C: 04-1] P:dns::recursor: remove obsolete buster conditional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931619 (owner: 10Ssingh) [14:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41848/console" [puppet] - 10https://gerrit.wikimedia.org/r/931619 (owner: 10Ssingh) [14:07:27] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4051.ulsfo.wmnet} and A:cp [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:53] (03PS8) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [14:08:31] taavi: I guess I should start marking patches WIP :) [14:09:01] (03CR) 10Vgutierrez: [C: 03+1] role::cache::{text,upload}: move vk codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/931498 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:09:11] :P just happened to spot it there and remembered my work on moving to the built-in exporter a while back [14:09:47] yes I realized when removing this that this is not applied anywhere else probably but I wanted that to be a separate patch [14:11:04] (03PS5) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [14:11:07] (03CR) 10Ssingh: [V: 03+1] P:dns::recursor: remove obsolete buster conditional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931619 (owner: 10Ssingh) [14:11:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4051.ulsfo.wmnet} and A:cp [14:11:47] (03CR) 10Ssingh: [V: 03+2 C: 03+2] P:dns::recursor: remove obsolete buster conditional [puppet] - 10https://gerrit.wikimedia.org/r/931619 (owner: 10Ssingh) [14:13:43] (03PS6) 10Jbond: puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) [14:14:11] (03PS2) 10Muehlenhoff: Add a type to the protocol used in a Ferm service [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) [14:14:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes1*.eqiad.wmnet [14:15:06] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0-9].eqiad.wmnet [14:15:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes102[0-9].eqiad.wmnet [14:15:30] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes201[0-9].codfw.wmnet [14:15:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes202[0-9].codfw.wmnet [14:15:56] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes10[12][0-9].eqiad.wmnet [14:16:06] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes20[12][0-9].codfw.wmnet [14:16:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:53] (03CR) 10JHathaway: [C: 03+2] "great thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [14:17:56] (03CR) 10Jbond: [C: 03+2] puppetserver::ca: puppetise the CA [puppet] - 10https://gerrit.wikimedia.org/r/931615 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [14:18:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/931600 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:18:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host parse1002.eqiad.wmnet [14:18:17] (03PS1) 10Ssingh: P:prometheus::pdns_rec_exporter: remove obsolete profile [puppet] - 10https://gerrit.wikimedia.org/r/931620 [14:19:22] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41849/console" [puppet] - 10https://gerrit.wikimedia.org/r/931620 (owner: 10Ssingh) [14:20:03] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations: custom partman recipe dumpsdata100X-no-data-format.cfg causes installer to hang at partitioning menu - https://phabricator.wikimedia.org/T339929 (10ArielGlenn) [14:20:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:21:59] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:prometheus::pdns_rec_exporter: remove obsolete profile [puppet] - 10https://gerrit.wikimedia.org/r/931620 (owner: 10Ssingh) [14:22:28] jbond: I merged your snake oil :) [14:22:35] as an fyi mostly [14:23:23] (03PS9) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [14:24:05] (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931306 (https://phabricator.wikimedia.org/T128150) (owner: 10Ladsgroup) [14:24:14] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:24:28] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:24:56] (03PS1) 10Jbond: puppetdb: change to alternate port [puppet] - 10https://gerrit.wikimedia.org/r/931621 (https://phabricator.wikimedia.org/T339913) [14:25:37] (03PS10) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [14:25:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931617 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:25:55] (03CR) 10Jbond: [C: 03+2] puppetdb: change to alternate port [puppet] - 10https://gerrit.wikimedia.org/r/931621 (https://phabricator.wikimedia.org/T339913) (owner: 10Jbond) [14:26:14] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:26:29] (03PS1) 10Snwachukwu: Migrate refine sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) [14:26:40] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:26:56] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:27:29] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:27:43] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T338132) (owner: 10Majavah) [14:28:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-labs: Permit bacula backup traffic [homer/public] - 10https://gerrit.wikimedia.org/r/931576 (https://phabricator.wikimedia.org/T338132) (owner: 10Majavah) [14:29:22] (03CR) 10MVernon: [V: 03+2 C: 03+2] thanos: add machinetranslation user [labs/private] - 10https://gerrit.wikimedia.org/r/931296 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [14:29:33] (03CR) 10MVernon: [C: 03+2] profile::thanos::swift: add machinetranslation user [puppet] - 10https://gerrit.wikimedia.org/r/931297 (https://phabricator.wikimedia.org/T335491) (owner: 10MVernon) [14:30:15] (03PS11) 10Btullis: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) [14:31:22] (03CR) 10Snwachukwu: [C: 03+1] "I only removed the temporary assignment of $spark-submit=/usr/bin/spark2-submit. I prefer we keep the parametization of spark-submit in sp" [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [14:33:02] (03CR) 10Ottomata: Test Refine_sanitize migration to spark3. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931284 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [14:36:19] !log homer run for CR eqiad/codfw to allow bacula traffic in from cloud-hosts (T338132, T339894) [14:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:25] T338132: cloudcontrol: review connectivity with backup system - https://phabricator.wikimedia.org/T338132 [14:36:25] T339894: cloudservices: codfw1dev: fix backups - https://phabricator.wikimedia.org/T339894 [14:36:50] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [14:36:54] 10SRE, 10Thumbor: Image 429 errors for most images on private wikis - https://phabricator.wikimedia.org/T338765 (10hnowlan) This is in part caused by T339863. Kubernetes hosts are being rate limited incorrectly - however, this is a symptom. The real cause here appears to be that we do not store generated resul... [14:37:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_machinetranslation:prod.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:38] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:39:14] (03CR) 10Fabfur: [C: 03+2] [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931550 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [14:39:24] (03PS1) 10Vgutierrez: mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) [14:39:36] (03CR) 10Fabfur: [C: 03+2] [beta] Update wgCdnServersNoPurge (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931550 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [14:40:08] (03Merged) 10jenkins-bot: [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931550 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [14:40:20] 10SRE, 10Wikimedia-Mailing-lists: Removal of email address from mailman removed all subscriptions - https://phabricator.wikimedia.org/T339341 (10hoo) >>! In T339341#8948969, @Ladsgroup wrote: > Can you unsub yourself from ops and stewards? I'll put you back No, doesn't work. [14:40:24] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:22] (03PS1) 10Ssingh: hiera: remove use_linux510_on_buster for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/931626 [14:42:28] jouncebot: nowandnext [14:42:28] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [14:42:28] In 1 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1600) [14:42:30] (03PS2) 10Vgutierrez: mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) [14:42:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [14:43:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:50] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-April-June), 10Patch-For-Review: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10MatthewVernon) Account in thanos should be ready now. [14:44:07] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/931626/41851/" [puppet] - 10https://gerrit.wikimedia.org/r/931626 (owner: 10Ssingh) [14:46:15] (03CR) 10Ssingh: [C: 03+1] mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:48:22] (03CR) 10Fabfur: [C: 03+1] mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [14:49:53] (03PS1) 10Jbond: puppetserver: increase memory in production [puppet] - 10https://gerrit.wikimedia.org/r/931627 [14:50:09] (03CR) 10Ottomata: "Would it be worth doing this in two patches, first with test/refine_santize.pp, so that we can ensure it works there?" [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [14:50:18] (03CR) 10CI reject: [V: 04-1] puppetserver: increase memory in production [puppet] - 10https://gerrit.wikimedia.org/r/931627 (owner: 10Jbond) [14:53:44] (03PS3) 10Vgutierrez: haproxy,mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) [14:55:09] (03CR) 10Hashar: [C: 03+1] "That is indeed required to upgrade the release* hosts :]" [puppet] - 10https://gerrit.wikimedia.org/r/931600 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney) [14:55:09] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:55:13] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:56:50] (03PS2) 10Jbond: puppetserver: increase memory in production [puppet] - 10https://gerrit.wikimedia.org/r/931627 [14:58:27] (03PS2) 10Snwachukwu: Migrate test/refine_sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) [14:58:33] (03CR) 10Jbond: [C: 03+2] puppetserver: increase memory in production [puppet] - 10https://gerrit.wikimedia.org/r/931627 (owner: 10Jbond) [14:58:58] (03CR) 10Snwachukwu: Migrate test/refine_sanitize to spark3. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [15:00:11] (03CR) 10Snwachukwu: [C: 03+1] Migrate test/refine_sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [15:00:27] (03CR) 10Jbond: [C: 03+1] "authorised via gchat" [puppet] - 10https://gerrit.wikimedia.org/r/930859 (https://phabricator.wikimedia.org/T336769) (owner: 10CDanis) [15:08:21] (03CR) 10Dzahn: vrts: post decom cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [15:09:44] (03CR) 10Dzahn: [C: 03+1] noc: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [15:09:55] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/931630 (owner: 10Klausman) [15:10:14] (03CR) 10Elukey: [C: 03+1] Also filter metawiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/931630 (owner: 10Klausman) [15:10:54] (03CR) 10Klausman: [C: 03+2] Also filter metawiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/931630 (owner: 10Klausman) [15:12:36] (03Merged) 10jenkins-bot: Also filter metawiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/931630 (owner: 10Klausman) [15:13:01] (03PS3) 10AOkoth: vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) [15:13:15] (03PS4) 10AOkoth: vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) [15:13:20] !log klausman@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:13:30] (03CR) 10AOkoth: vrts: post decom cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [15:13:36] !log klausman@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:13:53] !log klausman@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:14:14] !log klausman@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:14:40] (03CR) 10Dzahn: [C: 03+1] vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [15:18:56] (03PS1) 10Muehlenhoff: Kerberos: Pass firewall settings in tool-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/931633 [15:20:23] (03CR) 10Muehlenhoff: vrts: post decom cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [15:22:15] (03CR) 10Ssingh: "Nice catch!" [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:22:19] (03CR) 10Ssingh: [C: 03+1] haproxy,mtail: Track locally processed requests in cache::haproxy [puppet] - 10https://gerrit.wikimedia.org/r/931625 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [15:23:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931633 (owner: 10Muehlenhoff) [15:25:01] !log installing unbound security updates [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:14] (03CR) 10Ottomata: Migrate test/refine_sanitize to spark3. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [15:31:16] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10bd808) [[https://github.com/wikimedia/mediawiki-extensions-LdapAuthentication/blob/master/includes/LdapPrimaryAuthenticationProvider.php#L12... [15:32:06] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:46] 10SRE, 10SRE-Access-Requests: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [15:37:00] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [15:37:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:00] (03PS17) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [15:41:02] (03PS2) 10Ahmon Dancy: releases: switch releases to use git::clone checkout method [puppet] - 10https://gerrit.wikimedia.org/r/931256 (https://phabricator.wikimedia.org/T290260) (owner: 10Jbond) [15:41:49] (03CR) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:42:09] (03CR) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:42:20] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [15:42:44] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:13] (03PS1) 10Jcrespo: backup: Revert the addition of codfw-dev cloud hosts to the ignore list [puppet] - 10https://gerrit.wikimedia.org/r/931634 (https://phabricator.wikimedia.org/T338132) [15:46:53] (03CR) 10Jcrespo: [C: 03+2] backup: Revert the addition of codfw-dev cloud hosts to the ignore list [puppet] - 10https://gerrit.wikimedia.org/r/931634 (https://phabricator.wikimedia.org/T338132) (owner: 10Jcrespo) [15:53:40] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:56:23] (03CR) 10Eevans: [C: 03+2] cassandra: use python3 as python [puppet] - 10https://gerrit.wikimedia.org/r/930738 (https://phabricator.wikimedia.org/T310980) (owner: 10Eevans) [15:57:28] (03PS3) 10Snwachukwu: Migrate refine_sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) [15:58:08] (03PS1) 10Btullis: Bump the version of airflow installed on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) [15:58:29] (03PS1) 10Ottomata: wgEventBusStreamNamesMap - Remove page_change stream name override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931638 (https://phabricator.wikimedia.org/T336817) [15:59:13] (03CR) 10CI reject: [V: 04-1] wgEventBusStreamNamesMap - Remove page_change stream name override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931638 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:01:43] (03PS2) 10Ottomata: wgEventBusStreamNamesMap - Remove page_change stream name override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931638 (https://phabricator.wikimedia.org/T336817) [16:03:16] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [16:03:20] (03CR) 10Ottomata: [C: 03+2] wgEventBusStreamNamesMap - Remove page_change stream name override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931638 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:03:28] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:04:31] (03Merged) 10jenkins-bot: wgEventBusStreamNamesMap - Remove page_change stream name override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931638 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:05:01] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Sandra" [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [16:05:17] (03PS1) 10Ottomata: wgEventStreams - remove unused 'rc' stream names for page_change related streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931642 (https://phabricator.wikimedia.org/T336817) [16:05:39] (03CR) 10Snwachukwu: [C: 03+1] Migrate refine_sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [16:08:37] (03Abandoned) 10BryanDavis: python: Replace --mount with --wsgi-file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/925097 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [16:09:49] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [16:09:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [16:09:57] (03CR) 10BryanDavis: "Copy-n-paste fail. This change set became https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/1" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/925097 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [16:14:50] !log sudo cumin 'A:cp' 'disable-puppet "merging CR 931626"' [16:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:23] (03CR) 10Ssingh: [C: 03+2] hiera: remove use_linux510_on_buster for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/931626 (owner: 10Ssingh) [16:16:55] 10SRE, 10Domains: Mark Monitor administration panel (redirects for wikimedia.pl) - https://phabricator.wikimedia.org/T333827 (10BCornwall) [16:17:33] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventBusStreamNamesMap - Remove page_change stream name override - T336817 (duration: 07m 42s) [16:17:36] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [16:17:50] (03Abandoned) 10Arturo Borrero Gonzalez: cloudservices2005-dev: fix DNS address [puppet] - 10https://gerrit.wikimedia.org/r/930210 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:18:06] (03CR) 10Ottomata: [C: 03+2] Migrate refine_sanitize to spark3. [puppet] - 10https://gerrit.wikimedia.org/r/931623 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [16:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:18:42] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - remove unused 'rc' stream names for page_change related streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931642 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:19:49] (03Merged) 10jenkins-bot: wgEventStreams - remove unused 'rc' stream names for page_change related streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931642 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:20:36] (03PS1) 10Jbond: puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 [16:21:55] !log sudo cumin 'A:cp' 'enable-puppet "merging CR 931626"' [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:51] (03CR) 10CI reject: [V: 04-1] puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:24:54] (03PS1) 10Ottomata: evenstreams - publicly expose mediawiki.page_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/931646 (https://phabricator.wikimedia.org/T336817) [16:25:56] (03PS2) 10Arturo Borrero Gonzalez: cloud: codfw1dev: fix labsldapconfig to use newer server [puppet] - 10https://gerrit.wikimedia.org/r/931287 (https://phabricator.wikimedia.org/T338779) [16:25:58] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: refresh ldap hosts [puppet] - 10https://gerrit.wikimedia.org/r/931291 (https://phabricator.wikimedia.org/T338779) [16:28:35] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: wgEventStreams - remove unused rc stream names for page_change related streams - T336817 (duration: 07m 35s) [16:28:39] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [16:31:28] (03CR) 10Xcollazo: Bump the version of airflow installed on the analytics_test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [16:31:39] (03CR) 10Jbond: [C: 03+2] new yubikey w/ ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/930859 (https://phabricator.wikimedia.org/T336769) (owner: 10CDanis) [16:33:22] (03CR) 10Ottomata: [C: 04-1] "-1 until we get a privacy review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/931646 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [16:33:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/931563 (owner: 10Majavah) [16:33:52] (03PS2) 10Jbond: puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 [16:34:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41854/console" [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:35:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/931564 (owner: 10Majavah) [16:36:01] (03CR) 10CI reject: [V: 04-1] puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:42:18] 10SRE, 10DNS: Additional DNS entries for WikiLearn - https://phabricator.wikimedia.org/T339942 (10Ijon) [16:42:54] (03PS3) 10Jbond: puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 [16:44:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41855/console" [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:44:54] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:44:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:45:01] (03CR) 10CI reject: [V: 04-1] puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:10] (03PS4) 10Jbond: puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 [16:47:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41856/console" [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:47:48] (03PS1) 10Ssingh: learn.wiki: update DNS records [dns] - 10https://gerrit.wikimedia.org/r/931649 (https://phabricator.wikimedia.org/T339942) [16:48:18] (03CR) 10CI reject: [V: 04-1] puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:49:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3053.esams.wmnet,cp3055.esams.wmnet,cp3057.esams.wmnet,cp3059.esams.wmnet,cp3061.esams.wmnet,cp3063.esams.wmnet,cp3065.esams.wmnet} and A:cp [16:49:54] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on P{cp3053.esams.wmnet,cp3055.esams.wmnet,cp3057.esams.wmnet,cp3059.esams.wmnet,cp3061.esams.wmnet,cp3063.esams.wmnet,cp3065.esams.wmnet} and A:cp [16:50:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:25] (03PS5) 10Jbond: puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 [16:52:42] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:52:44] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:52:57] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:52:59] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [16:53:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41857/console" [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:54:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add code to create gitpuppet user [puppet] - 10https://gerrit.wikimedia.org/r/931645 (owner: 10Jbond) [16:55:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3053.esams.wmnet,cp3055.esams.wmnet,cp3057.esams.wmnet,cp3059.esams.wmnet,cp3061.esams.wmnet,cp3063.esams.wmnet,cp3065.esams.wmnet} and A:cp [16:57:27] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [16:57:34] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [16:59:57] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T1700) [17:00:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:27] (03PS1) 10Jbond: puppetserver::git: add post-merge to trigger g10k [puppet] - 10https://gerrit.wikimedia.org/r/931653 [17:03:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41858/console" [puppet] - 10https://gerrit.wikimedia.org/r/931653 (owner: 10Jbond) [17:03:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:24] (03CR) 10CI reject: [V: 04-1] puppetserver::git: add post-merge to trigger g10k [puppet] - 10https://gerrit.wikimedia.org/r/931653 (owner: 10Jbond) [17:06:59] (03PS1) 10Kimberly Sarabia: Turn off Zebra test for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931655 (https://phabricator.wikimedia.org/T337956) [17:08:48] (03PS1) 10Jgreen: Shift fundraising/frdata service cnames back to codfw. [dns] - 10https://gerrit.wikimedia.org/r/931656 (https://phabricator.wikimedia.org/T335446) [17:09:41] 10SRE, 10Infrastructure-Foundations: PuppetDB Netbox import script failing for cloudservices2004-dev - https://phabricator.wikimedia.org/T339953 (10cmooney) p:05Triage→03Low [17:12:30] (03PS2) 10Jbond: puppetserver::git: add post-merge to trigger g10k [puppet] - 10https://gerrit.wikimedia.org/r/931653 [17:13:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_esams [17:13:54] (03CR) 10Cathal Mooney: Add ferm rule to mark all server traffic as DSCP 0 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:14:12] (03PS3) 10Jbond: puppetserver::git: add post-merge to trigger g10k [puppet] - 10https://gerrit.wikimedia.org/r/931653 [17:15:38] (03CR) 10Dwisehaupt: [C: 03+2] "Shipit. Back in service." [dns] - 10https://gerrit.wikimedia.org/r/931656 (https://phabricator.wikimedia.org/T335446) (owner: 10Jgreen) [17:16:12] (03CR) 10Jbond: [C: 03+2] puppetserver::git: add post-merge to trigger g10k [puppet] - 10https://gerrit.wikimedia.org/r/931653 (owner: 10Jbond) [17:24:57] (03CR) 10Cathal Mooney: Add ferm rule to mark all server traffic as DSCP 0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:29:38] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931655 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [17:35:36] (03PS1) 10Jbond: puppetserver: use sudo to run g10k from post-merge [puppet] - 10https://gerrit.wikimedia.org/r/931659 (https://phabricator.wikimedia.org/T330490) [17:36:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41859/console" [puppet] - 10https://gerrit.wikimedia.org/r/931659 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:38:13] (03CR) 10Ottomata: [C: 03+2] Remove dse mediawiki-page-content-change-enrichment and stream-enrichment-poc ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/927224 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [17:38:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: use sudo to run g10k from post-merge [puppet] - 10https://gerrit.wikimedia.org/r/931659 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:40:47] (03Merged) 10jenkins-bot: Remove dse mediawiki-page-content-change-enrichment and stream-enrichment-poc ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/927224 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [17:44:09] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:44:14] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:44:41] !log remove stream-enrichment-poc namespace and related resources from dse-k8s-eqiad - T325303 [17:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:45] T325303: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 [17:46:36] (03PS1) 10Jbond: puppetserver: fix hook sources [puppet] - 10https://gerrit.wikimedia.org/r/931661 [17:47:07] (03PS2) 10Jbond: puppetserver: fix hook sources [puppet] - 10https://gerrit.wikimedia.org/r/931661 (https://phabricator.wikimedia.org/T330490) [17:47:40] (03CR) 10Jbond: [C: 03+2] puppetserver: fix hook sources [puppet] - 10https://gerrit.wikimedia.org/r/931661 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:51:46] (03CR) 10BCornwall: [C: 03+1] learn.wiki: update DNS records [dns] - 10https://gerrit.wikimedia.org/r/931649 (https://phabricator.wikimedia.org/T339942) (owner: 10Ssingh) [17:52:08] (03CR) 10Ssingh: [C: 03+2] learn.wiki: update DNS records [dns] - 10https://gerrit.wikimedia.org/r/931649 (https://phabricator.wikimedia.org/T339942) (owner: 10Ssingh) [17:52:13] (03PS2) 10Ssingh: learn.wiki: update DNS records [dns] - 10https://gerrit.wikimedia.org/r/931649 (https://phabricator.wikimedia.org/T339942) [17:52:22] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS entries for WikiLearn - https://phabricator.wikimedia.org/T339942 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03ssingh [17:54:04] !log running authdns-update for T339942 [17:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:08] T339942: Additional DNS entries for WikiLearn - https://phabricator.wikimedia.org/T339942 [17:54:10] !log joal@deploy1002 Started deploy [analytics/refinery@181eac6]: Hotfix analytics deploy [analytics/refinery@181eac6] [18:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:32] !log joal@deploy1002 Finished deploy [analytics/refinery@181eac6]: Hotfix analytics deploy [analytics/refinery@181eac6] (duration: 06m 22s) [18:01:20] !log joal@deploy1002 Started deploy [analytics/refinery@181eac6] (thin): Hotfix analytics deploy THIN [analytics/refinery@181eac6] [18:01:24] !log joal@deploy1002 Finished deploy [analytics/refinery@181eac6] (thin): Hotfix analytics deploy THIN [analytics/refinery@181eac6] (duration: 00m 04s) [18:01:45] !log joal@deploy1002 Started deploy [analytics/refinery@181eac6] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@181eac6] [18:03:37] !log joal@deploy1002 Finished deploy [analytics/refinery@181eac6] (hadoop-test): Hotfix analytics deploy TEST [analytics/refinery@181eac6] (duration: 01m 52s) [18:04:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:55] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@d55173d]: (no justification provided) [18:13:07] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@d55173d]: (no justification provided) (duration: 00m 11s) [18:14:59] (03PS1) 10Jbond: puppetserver: add merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) [18:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41860/console" [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:16:59] (03CR) 10CI reject: [V: 04-1] puppetserver: add merge_cli [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:17:26] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:45] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:20:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:31] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10ssingh) Hi @Jdforrester-WMF: I am the SRE on clinic duty and happy to fulfill this request. Some notes before we c... [18:24:39] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:24:49] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:25:16] (03CR) 10Jforrester: [C: 03+2] Change sampling unit & 2 other revisions to wikifunctions.ui stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931316 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [18:26:04] (03Merged) 10jenkins-bot: Change sampling unit & 2 other revisions to wikifunctions.ui stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931316 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [18:26:36] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:26:48] (03PS1) 10Ssingh: admin: update membership for deployment and deploy-service [puppet] - 10https://gerrit.wikimedia.org/r/931675 (https://phabricator.wikimedia.org/T339936) [18:28:34] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:28:51] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:28:57] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:30:34] (03PS2) 10Jbond: puppetserver: add merge_cli as a seperate module [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) [18:32:49] (03CR) 10CI reject: [V: 04-1] puppetserver: add merge_cli as a seperate module [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:33:23] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [18:35:49] (03CR) 10Eevans: cassandra: add initial support for PKI TLS certs to 4.x (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [18:37:05] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bullseye [18:37:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye [18:37:24] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [18:41:15] (03PS3) 10Jbond: puppetserver: add merge_cli as a seperate module [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) [18:44:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41862/console" [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:45:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add merge_cli as a seperate module [puppet] - 10https://gerrit.wikimedia.org/r/931673 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:47:45] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [18:47:53] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:48:31] (03PS1) 10Dzahn: gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 [18:48:57] (03CR) 10CI reject: [V: 04-1] gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (owner: 10Dzahn) [18:49:30] (03PS2) 10Dzahn: gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) [18:49:54] (03CR) 10CI reject: [V: 04-1] gerrit: backup /home on gerrit servers in Bacula [puppet] - 10https://gerrit.wikimedia.org/r/931680 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [18:50:11] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:50:16] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:51:03] (03PS1) 10Jbond: merge_cli: fix source paths [puppet] - 10https://gerrit.wikimedia.org/r/931681 (https://phabricator.wikimedia.org/T330490) [18:52:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41863/console" [puppet] - 10https://gerrit.wikimedia.org/r/931681 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:52:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] merge_cli: fix source paths [puppet] - 10https://gerrit.wikimedia.org/r/931681 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:54:04] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:54:12] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:55:56] (03PS4) 10Samtar: IS: Enable Phonos on test2wiki, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930614 (https://phabricator.wikimedia.org/T336763) [19:01:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:53] (03PS2) 10Btullis: Bump the version of airflow installed on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) [19:04:55] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:05:11] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:05:54] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:06:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:06:42] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:07:57] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:08:01] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:09:58] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:10:01] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:10:12] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage [19:11:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:11:37] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:11:47] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:12:08] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:13:03] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS buster [19:13:07] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:13:19] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage [19:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:12] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [19:16:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:16:21] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [19:18:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:39] !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [19:29:51] (03PS1) 10Kosta Harlan: Section images: Fix ve.scrollIntoView override [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) [19:30:37] !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - robh@cumin1001" [19:30:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1022.eqiad.wmnet with OS bullseye [19:30:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye completed: - dbproxy1022 (**... [19:33:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in th... [19:33:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts parse1002.eqiad.wmnet [19:33:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) [19:33:43] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [19:33:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [19:35:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:37:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.312 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:00] (03PS1) 10Btullis: Add a datahub_kafka_jumbo connection to the analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/931683 (https://phabricator.wikimedia.org/T333004) [19:40:36] (03CR) 10CI reject: [V: 04-1] Add a datahub_kafka_jumbo connection to the analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/931683 (https://phabricator.wikimedia.org/T333004) (owner: 10Btullis) [19:42:07] (03PS2) 10Btullis: Add a datahub_kafka_jumbo connection to the analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/931683 (https://phabricator.wikimedia.org/T333004) [19:42:30] (03PS1) 10Gergő Tisza: Backport translations from master [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) [19:50:21] (03CR) 10Btullis: [C: 03+2] Add a datahub_kafka_jumbo connection to the analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/931683 (https://phabricator.wikimedia.org/T333004) (owner: 10Btullis) [19:55:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10RobH) >>! In T326346#8950646, @RobH wrote: > Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controlle... [19:59:23] (03CR) 10Kosta Harlan: [C: 03+1] Backport translations from master [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) (owner: 10Gergő Tisza) [19:59:39] (03CR) 10CI reject: [V: 04-1] Section images: Fix ve.scrollIntoView override [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) (owner: 10Kosta Harlan) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230620T2000). Please do the needful. [20:00:06] kimberly_sarabia and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] hello [20:01:09] * TheresNoTime can deploy :) [20:01:15] thanks [20:01:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931655 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [20:01:37] (03CR) 10Btullis: Bump the version of airflow installed on the analytics_test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [20:02:15] (03Merged) 10jenkins-bot: Turn off Zebra test for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931655 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [20:02:32] !log samtar@deploy1002 Started scap: Backport for [[gerrit:931655|Turn off Zebra test for multiple wikis (T337956)]] [20:02:37] T337956: Turn off Zebra A/B Test - https://phabricator.wikimedia.org/T337956 [20:03:02] (03CR) 10Samtar: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) (owner: 10Kosta Harlan) [20:03:48] o/ [20:03:58] !log samtar@deploy1002 ksarabia and samtar: Backport for [[gerrit:931655|Turn off Zebra test for multiple wikis (T337956)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:04:09] kimberly_sarabia: that's live on mwdebug, can you test? [20:04:13] (03CR) 10CI reject: [V: 04-1] Backport translations from master [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) (owner: 10Gergő Tisza) [20:05:01] tgr_: o/ looks like your patches are failing CI? [20:06:06] those look like the periodic parser integration test failures [20:06:27] TheresNoTime: sure one moment [20:06:49] I guess it's related to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/929792 ? [20:07:04] kostajh: they did both fail on the same bit.. :/ [20:07:10] we could backport that to wmf.13 [20:07:54] nvm, I thought it was a test-only change [20:08:13] I don't want to mess with something that changes parser output [20:08:24] yeah :/ [20:08:39] James_F: ^ [20:08:52] kimberly_sarabia: (ack) [20:09:04] Argh. [20:09:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS buster [20:10:00] Did Parsoid change their tests? [20:10:06] kimberly_sarabia: not sure if you meant to DM, but just noting the "lgtm" here :D [20:10:06] TheresNoTime: LGTM. Thanks! [20:10:08] And that shipped too early? [20:10:25] No I didn't meant to DM. sorry. thanks! [20:10:39] James_F: unsure, just seeing these failures - https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium-docker/109400/console [20:10:52] TheresNoTime: Yes I know. [20:11:10] But the patch in TMH didn't get into this week's branch. [20:11:33] Wait. But it should have? [20:12:11] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/TimedMediaHandler/+log/refs/heads/wmf/1.41.0-wmf.13 [20:12:24] there's no train this week? [20:12:32] Oh, right, duh. :-) [20:13:15] But how did the new tests change if there's no backport? [20:13:32] the failing test is /workspace/src/tests/phpunit/suites/ParserIntegrationTest.php , that sounds like a core test [20:13:38] Yeah. [20:13:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bullseye [20:13:46] But it's fed data for test cases from extensions. [20:13:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye [20:13:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1026.eqiad.wmnet with OS bullseye [20:14:03] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye [20:14:12] And auto-discovery of `tests/parser/*`. [20:14:12] but the issue is definitely that TMH patch removing a bunch of data- attributes [20:14:39] Why is Growth pulling TMH in the first place? [20:14:57] (03CR) 10Daniel Kinzler: [C: 03+1] "I was about to +2 when I realized this is a config patch 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/925654 (owner: 10Tim Starling) [20:15:04] Oh, right, it's a gate extension. [20:15:51] Is zuul-cloner somehow wrongly pulling the master branch instead of wmf/1.41.0-wmf.13? [20:16:05] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:931655|Turn off Zebra test for multiple wikis (T337956)]] (duration: 13m 32s) [20:16:09] kimberly_sarabia: live on prod [20:16:09] T337956: Turn off Zebra A/B Test - https://phabricator.wikimedia.org/T337956 [20:16:10] https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/931326 looks like it could be the cause, but it's not backported [20:16:29] Ah! You're pulling Parsoid master? [20:16:43] 00:01:03.446 INFO:zuul.Cloner.mediawiki/services/parsoid:upstream repo is missing branch wmf/1.41.0-wmf.13 [20:16:44] 00:01:04.092 INFO:zuul.Cloner.mediawiki/services/parsoid:Falling back to branch master [20:16:44] 00:01:04.213 INFO:zuul.Cloner.mediawiki/services/parsoid:Prepared mediawiki/services/parsoid repo with branch master at commit 6c15fb954fd8f579760c159f18d8feb96204508f [20:16:59] Yeah, don't ever ever ever use Parsoid in your CI, it will always break the world, this is why we tell you not to. :-( [20:17:42] Yeah, you are right, Parsoid runs on master sometimes. We ran into this issue once, Subbu wrote an explanation of why, don't rememver where though. [20:18:08] You really should use Parsoid "correctly" as a library rather than trying to run it's pre-release master branch. [20:18:27] But https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/931326 is the patch that's blocking you, technically. [20:19:04] (03CR) 10Xcollazo: [C: 03+1] Add a datahub_kafka_jumbo connection to the analytics airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/931683 (https://phabricator.wikimedia.org/T333004) (owner: 10Btullis) [20:19:43] reading through the phab task, the TMH patch seems harmless to backport [20:19:56] James_F: any concerns about doing that? [20:20:10] tgr Should be fine but it'll fail tests with Parsoid instead. [20:20:11] (03CR) 10Xcollazo: [C: 03+1] Bump the version of airflow installed on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [20:20:23] TheresNoTime: looks good on prod. TY! [20:21:06] tgr_: I'd recommend disabling the load-Parsoid-twice-for-Growth CI config instead, but I think you depend on that? [20:21:23] tgr_: BUT PERHAPS. [20:21:31] Gah, caps-lock, sorry. [20:21:43] (03PS1) 10Jforrester: Remove unused data attribs on a/v sources [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931085 (https://phabricator.wikimedia.org/T199129) [20:21:47] I think it's needed for VE to work in tests? or was at a time? my grasp of CI is very vague [20:23:15] I think it's for checking future-not-release-yet changes in Parsoid? [20:24:40] https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/4bb5140cbcc55aa98b5b4c462f109c5b916f45db [20:25:06] "Note that mediawiki/services/parsoid is cloned by CI from the master branch and thus the CI job will break whenever the branch has changes not back compatible with what is currently pinned in mediawiki/core and mediawiki/vendor." [20:25:13] As hashar said. :-( [20:25:22] https://docs.google.com/document/d/1LyXF606svXvp8wu_6NsZQjaiVtg4F69yqNSwmVlsNcA/edit#heading=h.hhc75wrw4tc7 has some notes from a previous time when a similar thing happened [20:25:54] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [20:26:01] taavi: or someone, free to take over deployment if/when ^ gets resolved? [20:26:17] hm? [20:26:23] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1026.eqiad.wmnet with reason: host reimage [20:26:46] we should probably drop the GrowthExperiments -> Flow dependency. [20:27:02] TheresNoTime: I can deploy the patches. [20:27:08] tgr_: okay, thanks. [20:27:54] tgr_: Yeah, but you'll still have it from Flow itself, as it is gated. [20:28:36] oh duh. Can we just drop Flow from the gate? It's not really supported anyway. [20:29:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [20:29:16] tgr_: Yeah, let me write that first. [20:30:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [20:30:11] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [20:30:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [20:30:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [20:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:39] tgr_: https://gerrit.wikimedia.org/r/c/integration/config/+/931687 [20:31:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1026.eqiad.wmnet with reason: host reimage [20:33:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:02] tgr_: For now, I think deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/931085 is the best step forward. [20:36:23] will do in a sec [20:36:27] Ack. [20:36:32] Will be around if needed. [20:40:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931085 (https://phabricator.wikimedia.org/T199129) (owner: 10Jforrester) [20:40:52] thanks for the help. [20:41:15] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:42:53] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:42:57] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:44:19] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:46:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1027.eqiad.wmnet with OS bullseye [20:46:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye completed: - dbproxy1027 (... [20:46:43] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:47:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:48:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1026.eqiad.wmnet with OS bullseye [20:48:08] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye completed: - dbproxy1026 (... [20:49:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:11] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [20:59:54] (03Merged) 10jenkins-bot: Remove unused data attribs on a/v sources [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931085 (https://phabricator.wikimedia.org/T199129) (owner: 10Jforrester) [21:00:08] !log tgr@deploy1002 Started scap: Backport for [[gerrit:931085|Remove unused data attribs on a/v sources (T199129)]] [21:00:19] T199129: Consider a slimmer HTML representation for videos - https://phabricator.wikimedia.org/T199129 [21:00:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:11] (03PS2) 10BCornwall: pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) [21:01:30] !log tgr@deploy1002 jforrester and tgr: Backport for [[gerrit:931085|Remove unused data attribs on a/v sources (T199129)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:01:31] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3053.esams.wmnet,cp3055.esams.wmnet,cp3057.esams.wmnet,cp3059.esams.wmnet,cp3061.esams.wmnet,cp3063.esams.wmnet,cp3065.esams.wmnet} and A:cp [21:01:37] (03CR) 10BCornwall: "Thanks for the review. Upon revisiting this it became apparent that the work was already done but the hostnames were never presented becau" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [21:05:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:36] Hey all - I’d like to get one update for T336027 deployed in /private [21:16:56] (03CR) 10Gergő Tisza: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) (owner: 10Kosta Harlan) [21:17:02] (03CR) 10Gergő Tisza: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) (owner: 10Gergő Tisza) [21:17:07] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:18:53] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:931085|Remove unused data attribs on a/v sources (T199129)]] (duration: 18m 45s) [21:18:59] T199129: Consider a slimmer HTML representation for videos - https://phabricator.wikimedia.org/T199129 [21:19:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) (owner: 10Kosta Harlan) [21:20:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) (owner: 10Gergő Tisza) [21:23:54] (03PS1) 10Ottomata: wgEventStreams - use eventgate-analytics-external for canary events of page_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931689 (https://phabricator.wikimedia.org/T336817) [21:25:30] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - use eventgate-analytics-external for canary events of page_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931689 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [21:25:53] (03PS1) 10Aqu: Fix datahub connections [puppet] - 10https://gerrit.wikimedia.org/r/931690 (https://phabricator.wikimedia.org/T333004) [21:25:56] !log Deployed updated mitigation for T336027 [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:19] (03Merged) 10jenkins-bot: wgEventStreams - use eventgate-analytics-external for canary events of page_content_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931689 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [21:26:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [21:26:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [21:26:53] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [21:27:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [21:27:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [21:28:04] (03PS1) 10Cathal Mooney: Juniper class-of-service config and updated border-in filter for QoS [homer/public] - 10https://gerrit.wikimedia.org/r/931691 (https://phabricator.wikimedia.org/T339850) [21:33:27] (03PS2) 10Cathal Mooney: Juniper class-of-service config and updated border-in filter for QoS [homer/public] - 10https://gerrit.wikimedia.org/r/931691 (https://phabricator.wikimedia.org/T339850) [21:35:12] (03CR) 10Cwhite: [C: 03+2] opensearch: disable security plugin on codfw [puppet] - 10https://gerrit.wikimedia.org/r/927771 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [21:36:53] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: wgEventStreams - page_content_change should use eventgate-analytics-external for canary events - T336817 (duration: 07m 22s) [21:36:58] T336817: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 [21:40:12] (03PS1) 10JHathaway: sshd: don't add AuthorizedKeysFile when we have no keys [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) [21:41:01] (03PS1) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [21:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:25] (03PS1) 10JHathaway: dev env: rsyslog exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) [21:48:50] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:48:57] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:49:10] (03Merged) 10jenkins-bot: Section images: Fix ve.scrollIntoView override [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931084 (https://phabricator.wikimedia.org/T339900) (owner: 10Kosta Harlan) [21:49:13] (03Merged) 10jenkins-bot: Backport translations from master [extensions/GrowthExperiments] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931685 (https://phabricator.wikimedia.org/T339225) (owner: 10Gergő Tisza) [21:49:20] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:49:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931695 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:49:29] !log tgr@deploy1002 Started scap: Backport for [[gerrit:931084|Section images: Fix ve.scrollIntoView override (T339900 T335209)]], [[gerrit:931685|Backport translations from master (T339225)]] [21:49:30] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:49:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931693 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:51:44] T339900: Uncaught TypeError: can't access property "then", ve.scrollIntoView(...) is undefined - https://phabricator.wikimedia.org/T339900 [21:51:45] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [21:51:45] T339225: Section-Level images: Backport latest translations - https://phabricator.wikimedia.org/T339225 [21:59:23] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2021.codfw.wmnet with OS buster [21:59:55] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_esams [22:01:35] !log tgr@deploy1002 tgr and kharlan: Backport for [[gerrit:931084|Section images: Fix ve.scrollIntoView override (T339900 T335209)]], [[gerrit:931685|Backport translations from master (T339225)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:01:41] T339900: Uncaught TypeError: can't access property "then", ve.scrollIntoView(...) is undefined - https://phabricator.wikimedia.org/T339900 [22:01:42] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [22:01:42] T339225: Section-Level images: Backport latest translations - https://phabricator.wikimedia.org/T339225 [22:11:59] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:931084|Section images: Fix ve.scrollIntoView override (T339900 T335209)]], [[gerrit:931685|Backport translations from master (T339225)]] (duration: 22m 30s) [22:12:06] T339900: Uncaught TypeError: can't access property "then", ve.scrollIntoView(...) is undefined - https://phabricator.wikimedia.org/T339900 [22:12:06] T335209: Section-level images: suggestions mode - https://phabricator.wikimedia.org/T335209 [22:12:06] T339225: Section-Level images: Backport latest translations - https://phabricator.wikimedia.org/T339225 [22:13:50] !log UTC late backports done [22:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:45] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:19:46] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2021.codfw.wmnet with reason: host reimage [22:22:14] (03PS1) 10Gergő Tisza: GrowthExperiments: Deploy section-level images structured task [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931696 (https://phabricator.wikimedia.org/T339126) [22:23:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [22:23:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [22:37:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bullseye [22:37:13] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye [22:39:36] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [22:47:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2021.codfw.wmnet with OS buster [22:59:36] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [23:00:27] PROBLEM - Check systemd state on wdqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:03] (03PS1) 10Dzahn: site: add buster people VMs to insetup role for decom [puppet] - 10https://gerrit.wikimedia.org/r/931699 (https://phabricator.wikimedia.org/T338827) [23:33:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1023.eqiad.wmnet with OS bullseye [23:33:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors: - db... [23:50:41] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:51:01] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:55:41] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:59:53] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status