[00:00:11] (03CR) 10Jcrespo: "We backup gerrit repositories every hour in order to at most lose 60 minute of changes, please don't do that for /home unless there is a r" [puppet] - 10https://gerrit.wikimedia.org/r/931714 (owner: 10Jcrespo) [00:00:39] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [00:01:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [00:01:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:01:25] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1001.eqiad.wmnet on all recursors [00:01:26] (03CR) 10Jcrespo: "See my sugestion." [puppet] - 10https://gerrit.wikimedia.org/r/931714 (owner: 10Jcrespo) [00:01:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1001.eqiad.wmnet on all recursors [00:01:54] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [00:02:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [00:05:55] (03PS2) 10Dzahn: gerrit: use default job defaults for home dir backup [puppet] - 10https://gerrit.wikimedia.org/r/931714 (https://phabricator.wikimedia.org/T336427) (owner: 10Jcrespo) [00:06:57] (03CR) 10Dzahn: gerrit: use default job defaults for home dir backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931714 (https://phabricator.wikimedia.org/T336427) (owner: 10Jcrespo) [00:07:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host phab-test1001.eqiad.wmnet with OS buster [00:07:42] (03CR) 10Dzahn: [C: 03+2] gerrit: use default job defaults for home dir backup [puppet] - 10https://gerrit.wikimedia.org/r/931714 (https://phabricator.wikimedia.org/T336427) (owner: 10Jcrespo) [00:08:00] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:09:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:09:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1023.eqiad.wmnet with OS bullseye [00:09:14] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye completed: - dbproxy1023 (... [00:10:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1025.eqiad.wmnet with OS bullseye [00:10:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye [00:10:34] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 774 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:16] (03CR) 10Jcrespo: [C: 03+2] gerrit: use default job defaults for home dir backup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931714 (https://phabricator.wikimedia.org/T336427) (owner: 10Jcrespo) [00:22:26] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [00:22:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1025.eqiad.wmnet with reason: host reimage [00:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:00] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host phab-test1001.eqiad.wmnet with OS buster [00:33:00] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host phab-test1001.eqiad.wmnet [00:33:07] (03PS1) 10Dzahn: install_server: add netboot line for phab-test [puppet] - 10https://gerrit.wikimedia.org/r/932035 (https://phabricator.wikimedia.org/T335080) [00:33:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:33:53] (03CR) 10Dzahn: [C: 03+2] install_server: add netboot line for phab-test [puppet] - 10https://gerrit.wikimedia.org/r/932035 (https://phabricator.wikimedia.org/T335080) (owner: 10Dzahn) [00:35:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:04] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:39:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931912 [00:39:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931912 (owner: 10TrainBranchBot) [00:40:25] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host phab-test1001.eqiad.wmnet [00:40:26] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [00:40:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1025.eqiad.wmnet with OS bullseye [00:40:50] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed: - dbproxy1025 (... [00:41:27] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) [00:41:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Jclark-ctr) 05Open→03Resolved [00:46:22] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [00:46:26] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host phab-test1001.eqiad.wmnet [01:00:13] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931912 (owner: 10TrainBranchBot) [01:13:30] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41911/console" [puppet] - 10https://gerrit.wikimedia.org/r/930739 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [01:14:21] (03CR) 10RLazarus: [V: 03+1 C: 03+2] deployment_server: Add opentelemetry-collector kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/930739 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [01:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:37:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:58:48] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:58:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (03PS2) 10RLazarus: admin_ng: Add namespace for opentelemetry-collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/930740 (https://phabricator.wikimedia.org/T320564) [02:19:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:27:48] (03CR) 10RLazarus: [C: 03+2] admin_ng: Add namespace for opentelemetry-collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/930740 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [02:30:06] (03Merged) 10jenkins-bot: admin_ng: Add namespace for opentelemetry-collector [deployment-charts] - 10https://gerrit.wikimedia.org/r/930740 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [02:32:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:35:30] !log rzl@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:37:24] !log rzl@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:47:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:51:37] !log rzl@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [02:52:24] !log rzl@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [02:57:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:05:28] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [03:07:06] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [03:16:33] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [03:17:10] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [04:04:29] (03PS1) 10KartikMistry: Update cxserver to 2023-06-21-112200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932042 (https://phabricator.wikimedia.org/T339896) [04:58:24] (03PS1) 10Marostegui: report_users.sh: Add dbproxy1022 [software] - 10https://gerrit.wikimedia.org/r/932045 (https://phabricator.wikimedia.org/T337812) [05:00:18] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Add dbproxy1022 [software] - 10https://gerrit.wikimedia.org/r/932045 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:00:51] (03Merged) 10jenkins-bot: report_users.sh: Add dbproxy1022 [software] - 10https://gerrit.wikimedia.org/r/932045 (https://phabricator.wikimedia.org/T337812) (owner: 10Marostegui) [05:14:50] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [05:16:54] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy1022 - marostegui@cumin1001" [05:17:04] (03PS1) 10ArielGlenn: rmeove bytemark from list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/932046 (https://phabricator.wikimedia.org/T217549) [05:17:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy1022 - marostegui@cumin1001" [05:17:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:18:50] (03PS2) 10ArielGlenn: remove bytemark from list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/932046 (https://phabricator.wikimedia.org/T217549) [05:20:01] (03CR) 10ArielGlenn: [C: 03+2] remove bytemark from list of dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/932046 (https://phabricator.wikimedia.org/T217549) (owner: 10ArielGlenn) [05:32:32] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [05:34:55] 10SRE, 10Dumps-Generation, 10Wikidata, 10observability, and 2 others: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10ArielGlenn) @fgiunchedi I notice that in some cases phab tasks are autocreated when systemd units fail. Is that true for sys... [05:57:03] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [05:58:29] (03CR) 10Muehlenhoff: [C: 03+2] Use sprintf() to build the config file [puppet] - 10https://gerrit.wikimedia.org/r/931901 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [05:58:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T0600) [06:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T0600). [06:01:34] marostegui: Let me know if it is OK to deploy cxserver when work on this ^^ window is done. [06:01:43] kart_: you can go now, we are not deplyin [06:02:07] kart_: unless told otherwise, you can assume those windows have nothing to deploy [06:07:54] (03CR) 10Elukey: [C: 03+1] analytics: Remove analytics106[1-3] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930581 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [06:08:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:11:16] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:41] (03PS5) 10Abijeet Patro: TranslationNotifications: Run UnsubscribeInactiveUsers periodically [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) [06:20:44] (03CR) 10Abijeet Patro: TranslationNotifications: Run UnsubscribeInactiveUsers periodically (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro) [06:27:09] (03CR) 10Slyngshede: [C: 03+2] P:idp:services Add netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/931937 (owner: 10Slyngshede) [06:27:12] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:27:50] marostegui: cool. [06:28:51] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-06-21-112200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932042 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [06:29:10] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy1023 - marostegui@cumin1001" [06:29:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy1023 - marostegui@cumin1001" [06:29:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:30:04] (03Merged) 10jenkins-bot: Update cxserver to 2023-06-21-112200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/932042 (https://phabricator.wikimedia.org/T339896) (owner: 10KartikMistry) [06:30:09] (03PS1) 10TChin: eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 [06:31:37] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:32:00] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:32:26] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:32:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:40] (03CR) 10Muehlenhoff: [C: 03+2] Retire legacy "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/931500 (https://phabricator.wikimedia.org/T313312) (owner: 10Muehlenhoff) [06:34:28] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy102[47] - marostegui@cumin1001" [06:35:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 for dbproxy102[47] - marostegui@cumin1001" [06:35:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:35:17] (03PS2) 10TChin: [WIP] eventstreams use kafka egress and service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 [06:35:31] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:36:05] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:37:09] (03PS1) 10Muehlenhoff: orchestrator: Remove ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/932166 [06:38:02] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:38:39] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:39:01] !log Updated cxserver to 2023-06-21-112200-production (T339896, T338123) [06:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:06] T339896: Enable MinT for all languages supported by IndicTrans2 - https://phabricator.wikimedia.org/T339896 [06:39:07] T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation - https://phabricator.wikimedia.org/T338123 [06:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:50:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) [06:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:54:27] (03PS1) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [06:55:11] !log rsync in ariel screensession on dumpsdata1003 pulling from dumpsdata1004, bwlimit 100000 (=1G) of misc dumps files [06:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:13] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:00:05] Amir1, apergos, and jnuche: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T0700). [07:00:18] (03CR) 10Marostegui: [C: 03+1] orchestrator: Remove ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/932166 (owner: 10Muehlenhoff) [07:00:19] morning! there are no trainees signed up today for deployment training, and no patches scheduled for deployment during this window either, a perfect match. have a nice quiet rest of the week everybody, and see you next time! [07:04:23] (03PS4) 10Muehlenhoff: Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) [07:06:07] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:19] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:12:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:20:09] (03PS2) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [07:22:45] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:34:59] (03PS3) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [07:35:24] Is there no train this week? [07:36:16] Jhs: nope [07:36:29] any particular reason? [07:36:50] Juneteenth and holiday in WMF in US [07:37:01] ah, makes sense [07:37:28] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:37:42] Amir1, do you think we could backport a small change to the Incubator extension then? I've been waiting for it to come with the train to do some stuff on-wiki [07:38:02] sure, what is it? [07:38:33] well, everything up to the current master version really (most importantly including updates from Translatewiki) [07:39:16] translatewiki updates trigger a full rebuild and take around an hour even to be deployed :( [07:39:36] ah, okay. i'll just wait for next week then, no problem [07:42:29] sorry :( [07:43:00] !log installing containerd security updates [07:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:46] Amir1, absolutely no problem, nothing depends on it except my impatience, haha [07:43:55] :D [07:46:50] (03PS4) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [07:48:19] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Aklapper) @neriah: No but anyone who wants to discuss this topic is free to post on the mailing list and explain the topic. [07:49:15] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [07:56:46] (03PS5) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [07:59:05] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:00:01] (03PS8) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [08:00:03] (03PS8) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [08:01:37] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, and 2 others: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Volans) @Dzahn I've deleted both IPs, nothing to sync as their DNS was managed manually and not via netbox: https://n... [08:06:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [08:06:59] (03CR) 10Volans: [C: 04-1] "it can be done with the current implementation" [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:07:04] (03PS6) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [08:08:53] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:59] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:33] (03CR) 10Volans: "clarified comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:15:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:17:52] (03PS7) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [08:18:00] (03CR) 10Jbond: [C: 03+1] "lgtm but lets get a +1 from moritz and taavi (as the only user of the code path) if we can" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:19:54] (03PS8) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [08:19:59] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:20:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:44] (03PS9) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [08:20:59] (03CR) 10Majavah: [C: 04-1] dev env: sshd, allow for user CA based auth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:21:50] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) [08:22:26] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test 931926 - jbond@cumin2002" [08:23:08] (03CR) 10CI reject: [V: 04-1] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:23:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:27] !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "test 931926 - jbond@cumin2002" [08:23:49] (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [08:24:03] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [08:25:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:25:24] (03PS1) 10Vgutierrez: hiera: Tighten haproxy port 80 timeouts globally [puppet] - 10https://gerrit.wikimedia.org/r/932173 (https://phabricator.wikimedia.org/T339898) [08:25:26] (03PS1) 10Vgutierrez: haproxy: Set port 80 maxconns to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/932174 (https://phabricator.wikimedia.org/T339898) [08:26:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:26:42] (03PS10) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [08:29:22] (03PS1) 10Daniel Kinzler: enwiki: Disable PC writes in parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) [08:30:22] (03PS9) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [08:30:24] (03PS9) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [08:30:31] (03PS2) 10Daniel Kinzler: enwiki: Disable PC writes in parsoid endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932175 (https://phabricator.wikimedia.org/T339867) [08:30:50] (03CR) 10Jbond: [C: 04-1] "-1: see inline" [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [08:35:48] (03CR) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:35:56] (03CR) 10Jbond: "lgtm but lets drop the superfluous requires" [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:37:27] (03CR) 10Jbond: "ditto lgtm but lets drop the require" [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [08:37:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41916/console" [puppet] - 10https://gerrit.wikimedia.org/r/932173 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [08:37:55] (03CR) 10Ladsgroup: "This change is ready for review." [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931720 (owner: 10Ladsgroup) [08:43:12] (03PS10) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [08:43:14] (03PS10) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [08:45:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932015 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [08:48:57] (03PS11) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [08:49:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Tighten haproxy port 80 timeouts globally [puppet] - 10https://gerrit.wikimedia.org/r/932173 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [08:49:34] (03CR) 10Jaime Nuche: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931912 (owner: 10TrainBranchBot) [08:50:34] !log tighten HAProxy timeouts on port 80 globally - T339898 [08:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:38] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [08:52:09] (03PS11) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [08:52:45] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/931968/41918/" [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [08:52:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932016 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [08:56:02] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [08:56:39] (03PS1) 10Muehlenhoff: ferm::service: Fix sprintf to pad zeros [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) [08:56:43] (03PS1) 10Jelto: miscweb: lower certificate_expiry_days to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/932181 (https://phabricator.wikimedia.org/T339862) [08:58:53] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/931964/41921/" [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:00:50] (03CR) 10Jbond: [C: 03+1] "LGTM but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:03:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:08:48] (03CR) 10Muehlenhoff: ferm::service: Fix sprintf to pad zeros (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:09:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/931912 (owner: 10TrainBranchBot) [09:09:26] (03PS2) 10EoghanGaffney: releases: Add motd warning about upcoming host change [puppet] - 10https://gerrit.wikimedia.org/r/932026 [09:09:52] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Set port 80 maxconns to 2000 [puppet] - 10https://gerrit.wikimedia.org/r/932174 (https://phabricator.wikimedia.org/T339898) (owner: 10Vgutierrez) [09:11:29] (03CR) 10CI reject: [V: 04-1] releases: Add motd warning about upcoming host change [puppet] - 10https://gerrit.wikimedia.org/r/932026 (owner: 10EoghanGaffney) [09:12:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:12:38] !log increasing maxconns to 2000 in haproxy for port 80 - T339898 [09:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:42] T339898: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 [09:13:00] (03PS3) 10EoghanGaffney: releases: Add motd warning about upcoming host change [puppet] - 10https://gerrit.wikimedia.org/r/932026 [09:14:39] (03PS12) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) [09:14:41] (03PS12) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [09:15:07] (03CR) 10Arturo Borrero Gonzalez: openldap: main-acls.erb: support keystone hosts without AAAA (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:17:57] 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Vgutierrez) 05Open→03Resolved Mitigated by tightening port 80 timeouts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/932173/1/hieradata/common/profile/cache/haproxy.yam... [09:20:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/931968/41923/" [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:22:40] (03PS9) 10Jbond: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [09:23:49] (03PS1) 10Slyngshede: LDAP Attributes: Move actions and tooltip to templatetag [software/bitu] - 10https://gerrit.wikimedia.org/r/932183 [09:25:12] (03CR) 10Ladsgroup: [C: 03+2] Fix adding a domain when the page doesn't exist [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931720 (owner: 10Ladsgroup) [09:25:22] (03CR) 10CI reject: [V: 04-1] Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [09:26:05] (03PS13) 10Arturo Borrero Gonzalez: codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) [09:26:26] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [09:26:50] (03CR) 10Jbond: Add cookbook to handle restarts of Wikimedia DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [09:27:24] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [09:27:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:28:27] (03CR) 10Jbond: Add cookbook to handle restarts of Wikimedia DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [09:29:17] (03PS7) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [09:29:27] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [09:29:46] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [09:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:02] (03CR) 10Jbond: [C: 03+1] "thanks for the explanation" [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:32:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931720 (owner: 10Ladsgroup) [09:33:03] (03CR) 10Gmodena: [C: 03+1] evenstreams - publicly expose mediawiki.page_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/931646 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [09:33:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:33:15] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [09:33:28] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openldap: main-acls.erb: support keystone hosts without AAAA [puppet] - 10https://gerrit.wikimedia.org/r/931968 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:33:38] !log root@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [09:34:20] (03PS11) 10Muehlenhoff: sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) [09:34:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: services: override cloudcontrol FQDN [puppet] - 10https://gerrit.wikimedia.org/r/931964 (https://phabricator.wikimedia.org/T340047) (owner: 10Arturo Borrero Gonzalez) [09:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:31] (03CR) 10Muehlenhoff: "Pushed one fixup after successfully testing with the incredibly useful test-cookbook script." [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [09:38:49] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Vgutierrez) This seems like a varnish (VCL) bug. Varnish is getting requests with X-Connection-Properties header set but it's failing to issue the expected X-Analytics-TLS... [09:40:30] !log root@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [09:40:31] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [09:42:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:48] (03Merged) 10jenkins-bot: Fix adding a domain when the page doesn't exist [extensions/AbuseFilter] (wmf/1.41.0-wmf.13) - 10https://gerrit.wikimedia.org/r/931720 (owner: 10Ladsgroup) [09:43:12] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:931720|Fix adding a domain when the page doesn't exist]] [09:44:30] (03CR) 10Muehlenhoff: ferm::service: Fix sprintf to pad zeros (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:44:37] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:931720|Fix adding a domain when the page doesn't exist]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [09:44:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:48] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10jbond) Before we implement this it would be useful to understand further why this needs to be adjust ed at run time, this feels inherently wrong to me... [09:47:11] (03Abandoned) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [09:47:30] (03Restored) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [09:48:05] (03CR) 10Jbond: "actully i changed my mind we may be able to rebase this on master once ferm::services has been refactored ill leave it for a bit" [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [09:48:10] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Vgutierrez) Full log of a request showing the misbehaviour: `counterexample * << Request >> 713485145... [09:49:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:46] (03PS8) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [09:51:17] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:931720|Fix adding a domain when the page doesn't exist]] (duration: 08m 05s) [09:52:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:52:48] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.drain-node: Add the option to reboot the drained node [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [09:55:48] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Vgutierrez) as shown on the full request example, this is happening on request restarts: `Begin req 713485144 restart` and our VCL logic excludes from TLS data be... [10:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1000) [10:00:21] (03PS1) 10Fabfur: hiera: Added new bullseye instance for cache-upload in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/932189 (https://phabricator.wikimedia.org/T327742) [10:00:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:34] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Vgutierrez) Regarding healthcheck.wikimedia.org those are actually plain text requests being issued by the UDS healthcheck: ` * << Request >> 621520726 - Begin... [10:01:55] (03CR) 10Muehlenhoff: firewall: migrate ferm::service to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [10:02:26] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930602 (owner: 10PipelineBot) [10:03:03] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930602 (owner: 10PipelineBot) [10:05:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:33] !log installing Apache security updates on Bullseye [10:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:49] (03PS1) 10Mvolz: Remove WorldCat references [deployment-charts] - 10https://gerrit.wikimedia.org/r/932194 (https://phabricator.wikimedia.org/T336297) [10:17:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:21:26] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) [10:21:54] (03PS2) 10Ilias Sarantopoulos: ml-services: upgrade transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) [10:22:25] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:22:53] (03PS3) 10Ilias Sarantopoulos: ml-services: upgrade llm image with new version of transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) [10:22:59] (03PS1) 10MVernon: hiera: set ms-be2068 to be an object expirer [puppet] - 10https://gerrit.wikimedia.org/r/932197 (https://phabricator.wikimedia.org/T229584) [10:23:05] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:23:51] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:24:21] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:25:00] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:25:29] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:27:14] (03CR) 10Elukey: ml-services: upgrade llm image with new version of transformers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:27:31] (03CR) 10Mvolz: [C: 03+2] Remove WorldCat references [deployment-charts] - 10https://gerrit.wikimedia.org/r/932194 (https://phabricator.wikimedia.org/T336297) (owner: 10Mvolz) [10:28:30] (03Merged) 10jenkins-bot: Remove WorldCat references [deployment-charts] - 10https://gerrit.wikimedia.org/r/932194 (https://phabricator.wikimedia.org/T336297) (owner: 10Mvolz) [10:29:32] (03PS9) 10Ladsgroup: mysql: Introduce sre.mysql.clone [cookbooks] - 10https://gerrit.wikimedia.org/r/931961 (https://phabricator.wikimedia.org/T340048) [10:29:32] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:29:35] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:29:50] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Lucy_Patterson_WMDE) Still thinking... is openglam-de@lists.wikimedia.org available? [10:29:59] (03PS4) 10Ilias Sarantopoulos: ml-services: upgrade llm image with new version of transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) [10:30:23] (03CR) 10Ilias Sarantopoulos: "You're right! I missed that one" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:31:12] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:31:31] (03CR) 10Elukey: [C: 03+1] ml-services: upgrade llm image with new version of transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:32:03] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:32:35] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:32:50] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:33:15] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:33:33] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:33:56] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:34:46] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: upgrade llm image with new version of transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:35:39] (03Merged) 10jenkins-bot: ml-services: upgrade llm image with new version of transformers [deployment-charts] - 10https://gerrit.wikimedia.org/r/932196 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:42:12] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:45:16] (03CR) 10Muehlenhoff: [C: 03+2] ferm::service: Fix sprintf to pad zeros [puppet] - 10https://gerrit.wikimedia.org/r/932180 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:45:40] (03PS5) 10Muehlenhoff: Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) [10:50:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:52:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:52:19] (03CR) 10Vgutierrez: [C: 03+1] hiera: Added new bullseye instance for cache-upload in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/932189 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [10:53:13] (03CR) 10Fabfur: [C: 03+2] hiera: Added new bullseye instance for cache-upload in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/932189 (https://phabricator.wikimedia.org/T327742) (owner: 10Fabfur) [10:55:24] (03CR) 10Volans: [C: 03+1] "post-merge +1" [cookbooks] - 10https://gerrit.wikimedia.org/r/932167 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [10:55:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:00:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:42] 10SRE, 10Wikimedia-Mailing-lists: Request GLAM-de mailing list - https://phabricator.wikimedia.org/T340008 (10Ladsgroup) `openglam-de` is doable. `g-l-a-m-de` is not really readable, it could be full names instead of abbreviations though if you don't want `openglam-de` [11:05:08] (03PS1) 10Volans: sre.dns.wipe-cache: fix doc link to wikitech [cookbooks] - 10https://gerrit.wikimedia.org/r/932204 [11:05:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:16] (03CR) 10Volans: [C: 03+2] sre.dns.wipe-cache: fix doc link to wikitech [cookbooks] - 10https://gerrit.wikimedia.org/r/932204 (owner: 10Volans) [11:12:55] (03Merged) 10jenkins-bot: sre.dns.wipe-cache: fix doc link to wikitech [cookbooks] - 10https://gerrit.wikimedia.org/r/932204 (owner: 10Volans) [11:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:23:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of testvm2002.codfw.wmnet to plain [11:24:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of testvm2002.codfw.wmnet to plain [11:28:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:29:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/931913 [11:32:21] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001'] [11:32:24] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['sessionstore2001'] [11:32:30] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [11:32:59] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [11:33:10] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [11:33:25] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [11:33:28] (03PS1) 10Elukey: Move eqiad varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) [11:33:30] (03PS1) 10Elukey: Move esams varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) [11:35:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [11:35:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41924/console" [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:36:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [11:37:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [11:37:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [11:38:16] (03CR) 10Btullis: [C: 03+2] Bump the version of airflow installed on the analytics_test instance [puppet] - 10https://gerrit.wikimedia.org/r/931637 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [11:38:37] (03PS2) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 [11:39:25] (03PS2) 10Elukey: Move esams varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) [11:39:27] (03PS1) 10Elukey: Move drmrs Varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) [11:41:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41925/console" [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:41:28] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [11:41:31] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [11:43:09] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:44:41] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [11:44:58] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [11:45:08] (03CR) 10Btullis: [C: 03+1] Move drmrs Varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932219 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:45:40] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [11:45:56] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [11:46:35] (03CR) 10Btullis: [C: 03+1] "Great! Nice cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/932218 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [11:46:40] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (https://phabricator.wikimedia.org/T308002) (owner: 10Slyngshede) [11:49:20] (03PS6) 10Samtar: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763) [11:57:40] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [11:57:47] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [11:58:26] (03PS1) 10Slyngshede: netbox:standalone Fix minor error in OIDC config. [puppet] - 10https://gerrit.wikimedia.org/r/932223 [12:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:10] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:03:17] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:03:22] (03CR) 10Slyngshede: [C: 03+2] netbox:standalone Fix minor error in OIDC config. [puppet] - 10https://gerrit.wikimedia.org/r/932223 (owner: 10Slyngshede) [12:04:09] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:04:14] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:04:57] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:05:00] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:05:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:07] (03CR) 10FNegri: cumin: Increase connect_timeout for slow servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [12:06:10] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:06:13] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:06:34] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:06:44] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [12:07:28] (03CR) 10Btullis: [C: 03+1] "This looks good to me. Is ml-cache our only cassandra 4.x based cluster right now?" [puppet] - 10https://gerrit.wikimedia.org/r/931276 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [12:09:50] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932026 (owner: 10EoghanGaffney) [12:09:52] (03CR) 10Btullis: [C: 03+1] analytics: Remove analytics106[1-3] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930581 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [12:17:07] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:25:06] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:25:10] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:25:24] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [12:26:26] (03PS2) 10EoghanGaffney: releases: Add new releases hosts to docker_registry_ha allowlist [puppet] - 10https://gerrit.wikimedia.org/r/932027 [12:26:28] (03PS1) 10EoghanGaffney: releases: Move the primary releases host from 1002 to 1003 [puppet] - 10https://gerrit.wikimedia.org/r/932228 [12:26:37] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:26:49] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:27:14] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:28:03] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2001.codfw.wmnet [12:28:08] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:28:13] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [12:29:54] (03PS1) 10EoghanGaffney: releases: Switch releases.d.w to releases1003 [dns] - 10https://gerrit.wikimedia.org/r/932230 [12:30:38] (03PS1) 10Slyngshede: R:idp remove NDA requirement from netbox_oidc. [puppet] - 10https://gerrit.wikimedia.org/r/932231 [12:31:03] (03CR) 10Slyngshede: [C: 03+2] R:idp remove NDA requirement from netbox_oidc. [puppet] - 10https://gerrit.wikimedia.org/r/932231 (owner: 10Slyngshede) [12:32:15] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [12:32:23] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [12:42:14] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) [12:42:17] (03PS1) 10Filippo Giunchedi: admin: update filippo's key [puppet] - 10https://gerrit.wikimedia.org/r/932233 (https://phabricator.wikimedia.org/T336769) [12:42:38] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10serviceops-collab: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Dzahn) Thank you, @Volans . removing netbox tag again [12:43:15] (03PS1) 10Filippo Giunchedi: Update filippo's key [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) [12:43:27] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932027 (owner: 10EoghanGaffney) [12:44:24] (03CR) 10Jelto: "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/932230 (owner: 10EoghanGaffney) [12:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:50] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/932228 (owner: 10EoghanGaffney) [12:49:06] logstash pretty noisy with `Expectation (maxAffected <= 1000) by MediaWiki\\Maintenance\\MaintenanceRunner::run not met` warnings on `mwmaint1002`.. [12:49:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:53] (03PS1) 10Ayounsi: knams: decom Datahop [homer/public] - 10https://gerrit.wikimedia.org/r/932236 (https://phabricator.wikimedia.org/T340049) [12:53:03] (03PS1) 10Muehlenhoff: Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) [12:55:36] (03CR) 10CI reject: [V: 04-1] Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [12:58:00] (03PS1) 10Btullis: Bump the version of eventgate-wikimedia deployed to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/932238 (https://phabricator.wikimedia.org/T267648) [12:58:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch all uses priority of ferm::service to numeric values [puppet] - 10https://gerrit.wikimedia.org/r/931902 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1300) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1300). [13:00:06] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] * TheresNoTime will deploy [13:00:18] go ahead :) [13:00:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:41] (03PS2) 10Elukey: Move eqiad varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) [13:00:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:01:22] I can only deploy later [13:01:23] go ahead :) [13:01:55] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog-Deprecated, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) >>! In T340036#8952843, @akosiaris wrote: >>>! In T340036#8952836, @MSantos wrote: >> Sounds great... [13:02:10] (03Merged) 10jenkins-bot: IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930008 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar) [13:02:18] (03CR) 10Vgutierrez: [C: 03+1] Move eqiad varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:02:20] (03PS2) 10Muehlenhoff: Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) [13:02:28] !log samtar@deploy1002 Started scap: Backport for [[gerrit:930008|IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode (T336763)]] [13:02:33] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:02:41] (03CR) 10Ayounsi: "Thanks for the clarifications. LGTM but please run it through John and/or Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:03:01] (03CR) 10Muehlenhoff: Add missing types to ferm::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:03:08] (03PS3) 10Muehlenhoff: Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) [13:03:55] !log samtar@deploy1002 samtar: Backport for [[gerrit:930008|IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode (T336763)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:03:56] 10SRE, 10AbuseFilter, 10serviceops, 10PHP 7.4 support: Regular expression "х[ÿý]и" match "х и" in Abusefilter - https://phabricator.wikimedia.org/T340068 (10Urbanecm) Adding some tags. Per the following conversation in -tech, this issue is caused by something within our php7.4-cli package: `lang=irc PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:39] (03CR) 10Filippo Giunchedi: "See CI failures (missing runbook) and nits inline, but other than that LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/931286 (https://phabricator.wikimedia.org/T339370) (owner: 10Jelto) [13:05:41] (03CR) 10CI reject: [V: 04-1] Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:06:12] * TheresNoTime sync [13:06:31] (03PS1) 10Slyngshede: P:netbox:standalone use idp-test for authentication [puppet] - 10https://gerrit.wikimedia.org/r/932239 [13:08:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932239 (owner: 10Slyngshede) [13:08:25] (03CR) 10Ottomata: [C: 03+1] Bump the version of eventgate-wikimedia deployed to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/932238 (https://phabricator.wikimedia.org/T267648) (owner: 10Btullis) [13:08:38] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: actually delete chunks from loki [puppet] - 10https://gerrit.wikimedia.org/r/929749 (https://phabricator.wikimedia.org/T335610) (owner: 10Cwhite) [13:08:42] (03CR) 10Btullis: [C: 03+2] Bump the version of eventgate-wikimedia deployed to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/932238 (https://phabricator.wikimedia.org/T267648) (owner: 10Btullis) [13:09:03] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Enable `logToSyslogCee` option for Excimer UI [puppet] - 10https://gerrit.wikimedia.org/r/930217 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [13:09:35] (03Merged) 10jenkins-bot: Bump the version of eventgate-wikimedia deployed to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/932238 (https://phabricator.wikimedia.org/T267648) (owner: 10Btullis) [13:11:42] (03CR) 10Slyngshede: [C: 03+2] P:netbox:standalone use idp-test for authentication [puppet] - 10https://gerrit.wikimedia.org/r/932239 (owner: 10Slyngshede) [13:11:55] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:930008|IS: Enable Phonos on 'small' projects, set PhonosInlineAudioPlayerMode (T336763)]] (duration: 09m 26s) [13:12:00] T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763 [13:12:37] (03CR) 10Elukey: [C: 03+2] Move eqiad varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/932217 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:12:40] (03PS1) 10Btullis: Remove the specific airflow version override for an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/932241 (https://phabricator.wikimedia.org/T336286) [13:12:44] (03CR) 10Filippo Giunchedi: "LGTM, see nit inline" [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [13:13:22] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [13:14:24] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [13:14:34] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [13:15:16] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:15:30] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [13:15:39] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [13:16:14] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [13:16:47] (done deploying) [13:17:00] !log move varnishafka instances in eqiad to PKI [13:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:13] (03CR) 10Ssingh: Add cookbook to handle restarts of Wikimedia DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [13:19:39] (03CR) 10Filippo Giunchedi: profile::pyrra::filesystem: add profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:20:14] (03CR) 10Volans: cumin: Increase connect_timeout for slow servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [13:21:00] (03Abandoned) 10Ssingh: O:dnsbox: clean-up service binding for pdns-rec/gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/931945 (owner: 10Ssingh) [13:21:06] (03PS1) 10Jbond: idpL: make netbox-next entry specific [puppet] - 10https://gerrit.wikimedia.org/r/932242 [13:22:17] (03PS1) 10Muehlenhoff: Fix multiple specs to run on our default OSes, not an OS we no longer use [puppet] - 10https://gerrit.wikimedia.org/r/932243 [13:30:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:44] (03CR) 10JHathaway: [C: 03+1] "look good" [puppet] - 10https://gerrit.wikimedia.org/r/932243 (owner: 10Muehlenhoff) [13:33:47] (03CR) 10Ayounsi: "A few comments but the overall approach lgtm." [homer/public] - 10https://gerrit.wikimedia.org/r/931691 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:38:06] (03PS2) 10Muehlenhoff: Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) [13:40:15] (03CR) 10Muehlenhoff: [C: 03+2] Fix multiple specs to run on our default OSes, not an OS we no longer use [puppet] - 10https://gerrit.wikimedia.org/r/932243 (owner: 10Muehlenhoff) [13:40:35] (03CR) 10CI reject: [V: 04-1] Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [13:40:53] TheresNoTime: we might try to sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/931696 now [13:42:32] hello, is there room for another change? [13:42:43] * Lucas_WMDE is around now [13:43:38] (03CR) 10Stevemunene: [C: 03+2] analytics: Remove analytics106[1-3] from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/930581 (https://phabricator.wikimedia.org/T317861) (owner: 10Stevemunene) [13:43:48] sergi0: are you talking about kostajh’s change or is this two separate requests? ^^ [13:44:10] it's the same one [13:44:11] (I could deploy something now if TheresNoTime is done) [13:44:17] Lucas_WMDE: go ahead :) [13:44:18] Lucas_WMDE: same change. Sorry, just connected :) [13:44:21] ok :) [13:45:40] (03PS2) 10Lucas Werkmeister (WMDE): GrowthExperiments: Deploy section-level images structured task [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931696 (https://phabricator.wikimedia.org/T339126) (owner: 10Gergő Tisza) [13:45:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931696 (https://phabricator.wikimedia.org/T339126) (owner: 10Gergő Tisza) [13:47:02] (03Merged) 10jenkins-bot: GrowthExperiments: Deploy section-level images structured task [mediawiki-config] - 10https://gerrit.wikimedia.org/r/931696 (https://phabricator.wikimedia.org/T339126) (owner: 10Gergő Tisza) [13:47:17] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:931696|GrowthExperiments: Deploy section-level images structured task (T339126)]] [13:47:21] T339126: Deploy section-level images structured task - https://phabricator.wikimedia.org/T339126 [13:48:40] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and tgr: Backport for [[gerrit:931696|GrowthExperiments: Deploy section-level images structured task (T339126)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:49:22] kostajh, sergi0: please test :) [13:49:32] testing now [13:50:19] (03CR) 10JHathaway: [C: 03+2] puppetserver::ca: add trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/932016 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [13:50:40] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41926/console" [puppet] - 10https://gerrit.wikimedia.org/r/932026 (owner: 10EoghanGaffney) [13:52:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:22] ok [13:53:34] (03PS2) 10JHathaway: dev env, ssh::client: create /etc/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) [13:53:36] (03PS1) 10Jbond: idp: use groups for the groups attribute when doing OIDC [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) [13:54:05] (03CR) 10CI reject: [V: 04-1] idp: use groups for the groups attribute when doing OIDC [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) (owner: 10Jbond) [13:54:07] Lucas_WMDE: things seem ok from my tests [13:54:07] (03CR) 10JHathaway: [C: 03+2] dev env, ssh::client: create /etc/ssh dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:54:11] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [13:54:11] (03CR) 10JHathaway: [V: 03+2 C: 03+2] dev env, ssh::client: create /etc/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932013 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:54:24] ok, syncing [13:54:27] (03PS1) 10Ssingh: O:dnsbox: clean-up dnsbox role and dns::recursor [puppet] - 10https://gerrit.wikimedia.org/r/932248 [13:55:04] (03CR) 10Btullis: [C: 03+2] Remove the specific airflow version override for an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/932241 (https://phabricator.wikimedia.org/T336286) (owner: 10Btullis) [13:56:41] (03PS2) 10JHathaway: dev env, ssh::server: create /run/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) [13:57:12] (03CR) 10JHathaway: [C: 03+2] dev env, ssh::server: create /run/ssh dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:57:14] (03CR) 10JHathaway: [V: 03+2 C: 03+2] dev env, ssh::server: create /run/ssh dir [puppet] - 10https://gerrit.wikimedia.org/r/932014 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:57:27] (03PS2) 10Stevemunene: analytics: Decommission analytics106[4-6] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/930582 (https://phabricator.wikimedia.org/T317861) [13:57:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41929/console" [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) (owner: 10Jbond) [13:57:47] (03PS3) 10Muehlenhoff: Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) [13:59:47] (03PS2) 10Jbond: idp: use groups for the groups attribute when doing OIDC [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) [14:00:01] (03CR) 10Btullis: [C: 03+2] Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:00:06] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:931696|GrowthExperiments: Deploy section-level images structured task (T339126)]] (duration: 12m 49s) [14:00:09] !log UTC afternoon backport+config window done [14:00:17] T339126: Deploy section-level images structured task - https://phabricator.wikimedia.org/T339126 [14:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:20] (03CR) 10CI reject: [V: 04-1] Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) (owner: 10Muehlenhoff) [14:00:26] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [14:00:57] (03Merged) 10jenkins-bot: Add support for upgrading datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/930825 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:01:02] !log volans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [14:01:05] (03CR) 10AOkoth: [C: 03+2] vrts: post decom cleanup [puppet] - 10https://gerrit.wikimedia.org/r/931610 (https://phabricator.wikimedia.org/T339253) (owner: 10AOkoth) [14:01:12] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) (owner: 10Jbond) [14:02:08] (03CR) 10Jbond: [C: 03+2] idp: use groups for the groups attribute when doing OIDC [puppet] - 10https://gerrit.wikimedia.org/r/932247 (https://phabricator.wikimedia.org/T308002) (owner: 10Jbond) [14:02:50] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Vgutierrez) @Volans even if this is the expected behavior right now we need to clarify the dashboards a little bit. The first scenario (upload.wm.o and req.restarts >= 1) s... [14:03:10] !log stevemunene@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [14:03:11] (03PS4) 10Muehlenhoff: Fix migration when "plain" instances are involved [cookbooks] - 10https://gerrit.wikimedia.org/r/932237 (https://phabricator.wikimedia.org/T203964) [14:03:35] (03PS4) 10Muehlenhoff: Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) [14:05:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:06:01] (03CR) 10CI reject: [V: 04-1] Add missing types to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:07:22] !log root@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [14:07:31] !log root@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [14:07:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:43] 10SRE, 10AbuseFilter, 10serviceops, 10PHP 7.4 support: Regular expression "х[ÿý]и" match "х и" in Abusefilter - https://phabricator.wikimedia.org/T340068 (10Daimona) Additional notes from IRC: `lang=irc Daimona: "Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not ex... [14:10:59] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudservices2004-dev.wikimedia.org [14:10:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudservices2004-dev.wikimedia.org [14:11:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:12:40] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [14:13:31] (03CR) 10Ottomata: [C: 03+1] mw-page-content-change-enrich: HA in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/931307 (https://phabricator.wikimedia.org/T338233) (owner: 10Gmodena) [14:15:53] (03PS2) 10Ssingh: O:dnsbox: clean-up dnsbox role and dns::recursor [puppet] - 10https://gerrit.wikimedia.org/r/932248 [14:16:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41930/console" [puppet] - 10https://gerrit.wikimedia.org/r/932248 (owner: 10Ssingh) [14:17:35] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:23] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:20:30] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/923386 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [14:22:52] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet iDRAC issues - https://phabricator.wikimedia.org/T340055 (10Eevans) From IRC: `lang=irc [ ... ] 9:11 AM there is no mac address learned on that port 9:12 AM also the interface is down 9:12 AM been down for 22h 9:12 AM 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Eevans) [14:26:17] (03CR) 10Muehlenhoff: [C: 03+1] "Confirmed via out-of-band channel (Slack)" [puppet] - 10https://gerrit.wikimedia.org/r/932233 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [14:27:25] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch 😊" [puppet] - 10https://gerrit.wikimedia.org/r/932233 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [14:28:13] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Jhancock.wm) @Eevans I checked the server out physically this morning. The port shows an active connection on the server. However, on the switch there is no active light. This might b... [14:29:44] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: update filippo's key [puppet] - 10https://gerrit.wikimedia.org/r/932233 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [14:30:31] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [14:31:26] 10SRE, 10Traffic: Webrequests live data shows traffic without TLS on varnish for upload.w.o - https://phabricator.wikimedia.org/T340097 (10Volans) @Vgutierrez I don't see that field in druid so I think we have to check if that's available when benthos parses the stream and set a field for it. This for the live... [14:32:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [14:32:36] (03CR) 10Muehlenhoff: [C: 03+1] "Confirmed via out-of-bounds channel (Slack)" [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [14:36:28] (03PS3) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [14:37:58] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [14:38:05] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [14:43:40] (03CR) 10Filippo Giunchedi: "Thank you for the verification! I'm not sure if I'm supposed to merge or wait for the next batch of deployments of keys?" [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [14:47:01] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [14:47:06] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [14:48:03] (03PS1) 10Btullis: Fix some naming issues with the datahub-upgrade jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932256 (https://phabricator.wikimedia.org/T329514) [14:50:43] !log upgrade dns3001 to gdnsd 3.99.0~alpha2 [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] (03CR) 10Btullis: [C: 03+2] Fix some naming issues with the datahub-upgrade jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932256 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:52:30] (03Merged) 10jenkins-bot: Fix some naming issues with the datahub-upgrade jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/932256 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:53:53] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [14:53:59] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [14:59:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jgreen) Confirmed traffic to frpig1001 has stopped. Taking it out of monitoring so we can shut it down and start the decom. [14:59:23] (03PS1) 10Jgreen: Remove frpig1001.frack.eqiad.wmnet from nsca_frack.cfg.erb in prep for decommission. [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) [15:00:00] (03CR) 10CI reject: [V: 04-1] Remove frpig1001.frack.eqiad.wmnet from nsca_frack.cfg.erb in prep for decommission. [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [15:01:59] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [15:02:06] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [15:08:18] (03CR) 10Ottomata: "small thing, but otherwise LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/932165 (owner: 10TChin) [15:09:19] (03PS1) 10AikoChou: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/932261 (https://phabricator.wikimedia.org/T328899) [15:11:44] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/932261 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:12:45] (03Merged) 10jenkins-bot: ml-services: update outlink transformer docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/932261 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [15:14:16] (03CR) 10JHathaway: [C: 03+2] puppetserver: fix config perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932015 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [15:16:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:18:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:19:17] (03PS1) 10Elukey: benthos: add X-Analytics' https field to the stream [puppet] - 10https://gerrit.wikimedia.org/r/932262 (https://phabricator.wikimedia.org/T340097) [15:21:58] (03CR) 10Ayounsi: Update filippo's key (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/932234 (https://phabricator.wikimedia.org/T336769) (owner: 10Filippo Giunchedi) [15:22:36] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:23:35] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) Rules created, but NOT enabled, The corresponding VCL is ` // FILTER T340036 // Give wikiwand and kiwix an extensi... [15:23:59] (03PS2) 10Jgreen: Remove frpig1001 from nsca_frack.cfg.erb in prep for decom. [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) [15:25:37] (03CR) 10Klausman: [C: 03+1] role::ml_cache::storage: use pki truststore [puppet] - 10https://gerrit.wikimedia.org/r/931903 (https://phabricator.wikimedia.org/T288470) (owner: 10Elukey) [15:27:21] sukhe, cdanis any thoughts about repooling codfw' [15:27:23] :? [15:28:00] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/932262 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:29:05] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [15:29:13] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [15:29:28] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: add X-Analytics' https field to the stream [puppet] - 10https://gerrit.wikimedia.org/r/932262 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:29:43] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove frpig1001 from nsca_frack.cfg.erb in prep for decom. [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [15:29:45] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:32:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:34:12] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [15:34:19] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [15:35:36] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Add motd warning about upcoming host change [puppet] - 10https://gerrit.wikimedia.org/r/932026 (owner: 10EoghanGaffney) [15:35:59] (03CR) 10Volans: [C: 03+1] "Thanks a lot! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932262 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:36:17] (03CR) 10Elukey: [C: 03+2] benthos: add X-Analytics' https field to the stream [puppet] - 10https://gerrit.wikimedia.org/r/932262 (https://phabricator.wikimedia.org/T340097) (owner: 10Elukey) [15:36:32] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] releases: Add motd warning about upcoming host change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932026 (owner: 10EoghanGaffney) [15:36:35] (03PS1) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:36:59] (03CR) 10CI reject: [V: 04-1] Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:37:06] eoghan: o/ ok to merge? [15:37:15] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:37:23] elukey: Damn your quick typing. Yes please! [15:37:51] ahahha done! [15:37:56] (03CR) 10EoghanGaffney: [C: 03+2] releases: Add new releases hosts to docker_registry_ha allowlist [puppet] - 10https://gerrit.wikimedia.org/r/932027 (owner: 10EoghanGaffney) [15:38:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [15:41:29] (03PS2) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:41:52] (03CR) 10CI reject: [V: 04-1] Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:44:34] (03PS3) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:44:57] (03CR) 10CI reject: [V: 04-1] Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:45:53] (03PS4) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [15:46:40] vgutierrez: back [15:46:43] and +1 on repooling [15:46:44] doing it [15:46:45] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [15:46:47] (03PS4) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:46:49] (03CR) 10JHathaway: dev env: sshd, allow for user CA based auth (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [15:46:52] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [15:47:10] (03CR) 10CI reject: [V: 04-1] Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:47:33] (03PS1) 10Ssingh: Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/932268 [15:48:14] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:48:47] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [15:49:13] (03CR) 10Ssingh: [C: 03+2] Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/932268 (owner: 10Ssingh) [15:49:18] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/932268 (owner: 10Ssingh) [15:49:25] (03PS2) 10Ssingh: Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/932268 [15:49:50] (03PS1) 10Btullis: Ensure that the datahub secrets are available when upgrading releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/932288 (https://phabricator.wikimedia.org/T329514) [15:50:40] !log running authdns-update to repool codfw [15:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:47] (03PS10) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [15:51:44] vgutierrez: done and thanks for bringing it up [15:51:57] sukhe: <3 [15:52:02] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [15:52:08] !log eevans@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [15:52:08] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [15:54:45] (03PS5) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:55:15] (03PS2) 10Btullis: Ensure that the datahub secrets are available when upgrading releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/932288 (https://phabricator.wikimedia.org/T329514) [15:55:28] (03PS5) 10Vgutierrez: haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) [15:55:30] (03PS8) 10Vgutierrez: hiera: Test HAProxy bw limits per URL on cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) [15:55:59] (03PS6) 10Muehlenhoff: Allow passing sets to an srange or drange (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/932286 (https://phabricator.wikimedia.org/T336497) [15:56:40] (03CR) 10Dwisehaupt: [C: 03+2] "Confirmed it is ready for decom." [puppet] - 10https://gerrit.wikimedia.org/r/932257 (https://phabricator.wikimedia.org/T319460) (owner: 10Jgreen) [15:56:53] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) @akosiaris the deadline we defined for the deprecation is July 1st 2023, we can flip the switch then. Does the rule a... [15:57:03] (03CR) 10Jbond: [C: 03+1] puppetserver: fix config perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932015 (https://phabricator.wikimedia.org/T339913) (owner: 10JHathaway) [15:57:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [15:57:52] (03CR) 10Vgutierrez: hiera: Test HAProxy bw limits per URL on cp4052 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [15:58:59] !log eevans@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sessionstore2001.codfw.wmnet'] [15:59:29] vgutierrez: sukhe: sorry was afk, sounds good and thanks <3 [15:59:51] no problem :) [16:00:06] cwhite: (Dis)respected human, time to deploy Logstash datacenter switchover to codfw (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1600). Please do the needful. [16:00:06] jbond and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1600) [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:26] cdanis: <3 [16:00:32] !log eevans@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS bullseye [16:00:38] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye [16:03:54] (03CR) 10Arturo Borrero Gonzalez: "thanks! some comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [16:07:01] !log eevans@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2001.codfw.wmnet with OS bullseye [16:07:12] 10SRE, 10Cassandra, 10Infrastructure-Foundations, 10Epic: Upgrade sessionstore to bullseye - https://phabricator.wikimedia.org/T340043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin2002 for host sessionstore2001.codfw.wmnet with OS bullseye executed with errors: - sessi... [16:08:10] (03PS11) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [16:08:34] (03CR) 10Btullis: [C: 03+2] Ensure that the datahub secrets are available when upgrading releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/932288 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [16:09:19] (03Merged) 10jenkins-bot: Ensure that the datahub secrets are available when upgrading releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/932288 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [16:10:05] (03CR) 10Ssingh: "Thanks for assuming the systemd ordering will be handled! Merging that patch is on me so apologies for the delay." [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:10:17] (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.codfw [puppet] - 10https://gerrit.wikimedia.org/r/931911 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [16:17:03] !log eevans@cumin2002 START - Cookbook sre.puppet.renew-cert for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:17:05] !log eevans@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:20:08] (03PS5) 10FNegri: cumin: Increase connect_timeout for slow servers [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) [16:20:34] (03PS1) 10Jgreen: Switch frdev.wm.o to new frdev1002 server. [dns] - 10https://gerrit.wikimedia.org/r/932293 (https://phabricator.wikimedia.org/T333485) [16:20:44] (03CR) 10BCornwall: [V: 03+1] "Running:" [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:21:37] !log eevans@cumin2002 START - Cookbook sre.puppet.renew-cert for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:21:38] !log eevans@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:22:53] !log eevans@cumin2002 START - Cookbook sre.puppet.renew-cert for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:23:11] !log eevans@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:24:02] !log eevans@cumin2002 START - Cookbook sre.puppet.renew-cert for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:24:08] !log eevans@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:25:40] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [16:26:17] !log eevans@cumin2002 START - Cookbook sre.puppet.renew-cert for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:27:27] !log eevans@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sessionstore2001.codfw.wmnet: Renew puppet certificate - eevans@cumin2002 [16:31:41] (03CR) 10Jbond: Add missing types to ferm::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/931890 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:34:23] (03CR) 10Dwisehaupt: [C: 03+2] "shipit" [dns] - 10https://gerrit.wikimedia.org/r/932293 (https://phabricator.wikimedia.org/T333485) (owner: 10Jgreen) [16:35:47] (03PS1) 10Btullis: Re-enable the bullseye based hadoop worker in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/932294 (https://phabricator.wikimedia.org/T329363) [16:35:51] (03CR) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [16:36:28] (03CR) 10Btullis: [C: 03+2] Re-enable the bullseye based hadoop worker in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/932294 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:36:54] (03CR) 10Ssingh: Add cookbook to handle restarts of Wikimedia DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:40:37] (03PS12) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [16:40:54] (03CR) 10BCornwall: [V: 03+1] Add cookbook to handle restarts of Wikimedia DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:42:14] (03CR) 10Ssingh: [C: 03+1] "Really nice work, thanks for working on this cookbook!" [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:42:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/932248 (owner: 10Ssingh) [16:42:24] (03PS1) 10Vivian Rook: remove git bits as they haven't been doing anything [puppet] - 10https://gerrit.wikimedia.org/r/932295 (https://phabricator.wikimedia.org/T340114) [16:42:42] (03PS13) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [16:44:02] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Eevans) In the course of today's troubleshooting, we tried: - Replacing the SFP-T - Replacing the cable - Swapping on-board NICs (from port 1 to port 2) - Swapping switch ports - Upg... [16:45:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [16:48:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:52:07] (03PS1) 10Btullis: Enable the PRESTO_EXPAND_DATA feature flag in Superset [puppet] - 10https://gerrit.wikimedia.org/r/932298 (https://phabricator.wikimedia.org/T340144) [16:53:15] (03CR) 10Majavah: [C: 04-1] dev env: sshd, allow for user CA based auth (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:54:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:58:45] (03CR) 10Btullis: [C: 03+2] Enable the PRESTO_EXPAND_DATA feature flag in Superset [puppet] - 10https://gerrit.wikimedia.org/r/932298 (https://phabricator.wikimedia.org/T340144) (owner: 10Btullis) [16:58:59] (03PS1) 10Majavah: P:quarry::base: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/932299 (https://phabricator.wikimedia.org/T340114) [17:00:04] bd808: Dear deployers, time to do the Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1700). [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T1700) [17:01:08] I don't have anything to push out today. There are a tiny number of updated developer-portal translations since last week, but not enough to bother deploying yet. [17:03:04] !log brett@cumin2002 START - Cookbook sre.dns.roll-restart-wikimedia-dns rolling restart_daemons on P{doh6001*} and A:wikidough [17:03:31] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-wikimedia-dns (exit_code=0) rolling restart_daemons on P{doh6001*} and A:wikidough [17:04:42] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:05:58] (03CR) 10BCornwall: [C: 03+2] Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [17:32:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp1082*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:cp-drmrs or A:cp-text_ [17:32:19] drmrs or A:cp-upload_drmrs) [17:32:23] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=97) Rolling upgrade/restart of Apache Traffic Server on P{cp1082*} and (A:cp-eqiad or A:cp-text_eqiad or A:cp-upload_eqiad or A:cp-codfw or A:cp-text_codfw or A:cp-upload_codfw or A:cp-esams or A:cp-text_esams or A:cp-upload_esams or A:cp-ulsfo or A:cp-text_ulsfo or A:cp-upload_ulsfo or A:cp-eqsin or A:cp-text_eqsin or A:cp-upload_eqsin or A:c [17:32:24] p-drmrs or A:cp-text_drmrs or A:cp-upload_drmrs) [17:36:12] (03CR) 10Andrew Bogott: [C: 03+2] P:quarry::base: fix file permissions [puppet] - 10https://gerrit.wikimedia.org/r/932299 (https://phabricator.wikimedia.org/T340114) (owner: 10Majavah) [17:37:23] (03CR) 10Ssingh: [V: 03+1] "Merging this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/932248 (owner: 10Ssingh) [17:48:54] (03Abandoned) 10Vivian Rook: remove git bits as they haven't been doing anything [puppet] - 10https://gerrit.wikimedia.org/r/932295 (https://phabricator.wikimedia.org/T340114) (owner: 10Vivian Rook) [17:58:13] (03PS9) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [17:59:56] (03PS10) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [18:00:26] (03PS1) 10BryanDavis: wmcs: Configure sudo not to prompt for passwords [puppet] - 10https://gerrit.wikimedia.org/r/932309 (https://phabricator.wikimedia.org/T205463) [18:02:13] (03CR) 10Ayounsi: "Not sure what I did but at least now tests pass." [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [18:07:41] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Configure sudo not to prompt for passwords [puppet] - 10https://gerrit.wikimedia.org/r/932309 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [18:15:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:03] (03PS21) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [18:34:12] (03CR) 10Jcrespo: "As mentioned on the ticket, I am not working on this anymore. But happy to review any amends." [puppet] - 10https://gerrit.wikimedia.org/r/931880 (https://phabricator.wikimedia.org/T339894) (owner: 10Arturo Borrero Gonzalez) [18:43:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [18:43:45] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931280 (https://phabricator.wikimedia.org/T323484) (owner: 10FNegri) [18:48:19] (03PS1) 10Dzahn: vrts: rename otrs_aliases to vrts_aliases [puppet] - 10https://gerrit.wikimedia.org/r/932316 [18:49:30] (03PS2) 10Dzahn: vrts: rename otrs_aliases to vrts_aliases [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) [18:51:02] (03CR) 10Dzahn: vrts: rename otrs_aliases to vrts_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [18:53:00] (03PS1) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 [18:54:23] (03PS1) 10Dzahn: vrts::web: replace OTRS with VRTS in comments [puppet] - 10https://gerrit.wikimedia.org/r/932319 (https://phabricator.wikimedia.org/T280392) [18:54:47] (03CR) 10Dzahn: "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/932319 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [18:58:05] (03PS1) 10Dzahn: vrts: replace OTRS in Wikitech monitoring notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) [19:03:40] (03PS1) 10Dzahn: vrts: replace OTRS string in exim4 config tempate [puppet] - 10https://gerrit.wikimedia.org/r/932322 (https://phabricator.wikimedia.org/T280392) [19:06:16] (03PS2) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) [19:07:17] (03PS3) 10Dzahn: vrts: rename exim config snippet [puppet] - 10https://gerrit.wikimedia.org/r/932317 (https://phabricator.wikimedia.org/T280392) [19:09:09] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [19:11:11] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [19:11:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [19:11:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:50] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache phab-test1001.eqiad.wmnet on all recursors [19:11:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) phab-test1001.eqiad.wmnet on all recursors [19:12:19] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [19:13:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM phab-test1001.eqiad.wmnet - dzahn@cumin1001" [19:14:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host phab-test1001.eqiad.wmnet with OS buster [19:18:54] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) [19:21:53] (03PS6) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [19:23:44] 10SRE, 10ops-codfw, 10DC-Ops: sessionstore2001.codfw.wmnet unable to PXE boot - https://phabricator.wikimedia.org/T340055 (10Dzahn) Could it be that the "1G RJ45/SFP converter" is the broken component? [19:25:11] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:26:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on phab-test1001.eqiad.wmnet with reason: host reimage [19:29:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab-test1001.eqiad.wmnet with reason: host reimage [19:30:20] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10thcipriani) >>! In T339936#8950370, @ssingh wrote: > 2. Further, @thcipriani this requires y... [19:32:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Configure sudo not to prompt for passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932309 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [19:33:10] (03PS1) 10Andrew Bogott: Revert "wmcs: Configure sudo not to prompt for passwords" [puppet] - 10https://gerrit.wikimedia.org/r/932269 [19:33:39] (03PS2) 10Ryan Kemper: sre.wdqs.data-transfer: fix broken logic [cookbooks] - 10https://gerrit.wikimedia.org/r/932324 (https://phabricator.wikimedia.org/T321605) [19:35:10] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs: Configure sudo not to prompt for passwords" [puppet] - 10https://gerrit.wikimedia.org/r/932269 (owner: 10Andrew Bogott) [19:41:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab-test1001.eqiad.wmnet with OS buster [19:41:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host phab-test1001.eqiad.wmnet [20:00:06] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230622T2000). [20:00:43] Nothing in the queue ^^ [20:00:51] beauty [20:01:02] {{done}} :) [20:13:32] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:16:54] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) >>! In T339936#8951911, @taavi wrote: > `deployment` includes `deploy-servi... [20:20:12] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` and `deploy-service` prod SRE groups - https://phabricator.wikimedia.org/T339936 (10taavi) >>! In T339936#8957331, @Jdforrester-WMF wrote: >>>! In T339936#8951911, @taavi wrote... [20:20:54] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team, 10Patch-For-Review: Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) [20:21:56] (03PS1) 10Kosta Harlan: gitlab runner: Allow mariadb:* images [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) [20:26:24] (03PS3) 10Dzahn: miscweb: add statictendril release to miscweb staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) [20:26:35] (03CR) 10Dzahn: miscweb: add statictendril release to miscweb staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/930886 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:36:39] (03CR) 10BryanDavis: wmcs: Configure sudo not to prompt for passwords (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932309 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [20:36:54] (03PS1) 10BryanDavis: wmcs: Configure sudo not to prompt for passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/932330 (https://phabricator.wikimedia.org/T205463) [20:40:06] 10SRE, 10SRE-Access-Requests: Drop the `deploy-service` right, move two included users to `deployment` (or drop) - https://phabricator.wikimedia.org/T340165 (10Jdforrester-WMF) p:05Triage→03Low [20:40:25] 10SRE, 10SRE-Access-Requests: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10Jdforrester-WMF) [20:51:53] (03PS5) 10Dzahn: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:52:18] (03CR) 10CI reject: [V: 04-1] [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:55:48] (03PS1) 10BCornwall: fixup! pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/932333 [20:56:22] (03PS6) 10Dzahn: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:56:46] (03CR) 10Dzahn: "I tried to do the manual rebase to bring this back despite changes that happened in between." [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:56:48] (03CR) 10CI reject: [V: 04-1] [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [20:57:04] (03Abandoned) 10BCornwall: fixup! pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/932333 (owner: 10BCornwall) [20:57:54] (03PS7) 10Dzahn: [WIP] deployment: Use rsync::quickdatacopy, enable encryption [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [21:00:07] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:06:16] (03PS3) 10Reedy: Remove references to auth-api.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) [21:07:19] (03PS4) 10BCornwall: pybal: Fix hostnames not being sent on alert [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) [21:07:40] (03CR) 10BCornwall: pybal: Fix hostnames not being sent on alert (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/913004 (https://phabricator.wikimedia.org/T322377) (owner: 10BCornwall) [21:08:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:22] (03CR) 10Dzahn: "just trying to do my part because I once commented on it in the past and it still seems valid to switch to standardized quickdatacopy, so " [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [21:09:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:09:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:11:56] (03PS4) 10Dzahn: aphlict: add second envoy TLS terminator for admin port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) [21:12:42] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:14:48] (03PS5) 10Dzahn: aphlict: add second envoy TLS terminator for admin port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) [21:15:33] (03CR) 10Dzahn: "I made new ticket https://phabricator.wikimedia.org/T340169 for just this to have something to link to. Now I will abandon this. But it's" [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:16:00] (03Abandoned) 10Dzahn: aphlict: add second envoy TLS terminator for admin port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [21:16:54] (03Abandoned) 10Dzahn: admin: remove contint-roots from releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/928108 (owner: 10Dzahn) [21:18:40] (03Abandoned) 10Dzahn: phabricator: add test for /r/ redirect to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/879137 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [21:18:46] (03PS5) 10JHathaway: dev env: sshd, allow for user CA based auth [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) [21:21:41] (03CR) 10JHathaway: dev env: sshd, allow for user CA based auth (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/931694 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:21:56] (03CR) 10Thcipriani: [C: 03+1] gitlab runner: Allow mariadb:* images [puppet] - 10https://gerrit.wikimedia.org/r/932328 (https://phabricator.wikimedia.org/T339352) (owner: 10Kosta Harlan) [21:27:15] (03PS1) 10Dzahn: miscweb: remove static_tendril classes and files [puppet] - 10https://gerrit.wikimedia.org/r/932337 (https://phabricator.wikimedia.org/T300171) [21:30:08] (03PS1) 10Dzahn: miscweb: move tests for static_tendril to k8s tests file [puppet] - 10https://gerrit.wikimedia.org/r/932338 (https://phabricator.wikimedia.org/T300171) [22:02:38] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Configure sudo not to prompt for passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/932330 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [22:07:28] (03PS1) 10Andrew Bogott: Revert "wmcs: Configure sudo not to prompt for passwords (take 2)" [puppet] - 10https://gerrit.wikimedia.org/r/932271 [22:10:16] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs: Configure sudo not to prompt for passwords (take 2)" [puppet] - 10https://gerrit.wikimedia.org/r/932271 (owner: 10Andrew Bogott) [22:17:35] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:27:03] (03CR) 10Dzahn: "3 years later.. (cough:) I wanted to amend to this and do what you suggested (option a, stick with single-parameter format, don't split) " [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [22:29:25] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) >>! In T340036#8956346, @MSantos wrote: > @akosiaris the deadline we defined for the deprecation is July 1st 2023 (o... [22:30:41] (03PS9) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [22:31:04] (03CR) 10CI reject: [V: 04-1] httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [22:32:16] (03PS10) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [22:32:43] (03CR) 10CI reject: [V: 04-1] httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 (owner: 10Dzahn) [22:35:42] (03PS1) 10Ahmon Dancy: Run LDAP group sync periodically on active gitlab server [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) [22:36:02] (03PS11) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [22:36:06] (03CR) 10CI reject: [V: 04-1] Run LDAP group sync periodically on active gitlab server [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy) [22:38:20] (03PS2) 10Ahmon Dancy: Run LDAP group sync periodically on active gitlab server [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) [22:38:26] (03PS1) 10BryanDavis: wmcs: Configure sudo not to prompt for passwords (take 3) [puppet] - 10https://gerrit.wikimedia.org/r/932344 (https://phabricator.wikimedia.org/T205463) [22:38:49] (03PS2) 10BryanDavis: wmcs: Configure sudo not to prompt for passwords (take 3) [puppet] - 10https://gerrit.wikimedia.org/r/932344 (https://phabricator.wikimedia.org/T205463) [22:43:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [22:48:10] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/output/932344/41932/" [puppet] - 10https://gerrit.wikimedia.org/r/932344 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [22:49:12] (03PS12) 10Dzahn: httpbb: redesign how test suite files and dirs are created [puppet] - 10https://gerrit.wikimedia.org/r/648385 [22:50:10] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Configure sudo not to prompt for passwords (take 3) [puppet] - 10https://gerrit.wikimedia.org/r/932344 (https://phabricator.wikimedia.org/T205463) (owner: 10BryanDavis) [23:20:47] (03CR) 10Tim Starling: [C: 03+1] Remove references to auth-api.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/932270 (https://phabricator.wikimedia.org/T204193) (owner: 10Reedy)