[00:07:04] (03PS1) 10Dzahn: phabricator: remove enable_vcs parameter set in eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) [00:10:53] (03PS1) 10Dzahn: phabricator: remove vcs support, pt1 [puppet] - 10https://gerrit.wikimedia.org/r/983957 [00:13:23] (03PS1) 10Dzahn: phabricator: remove vcs support, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/983958 [00:15:10] (03PS1) 10Dzahn: phabricator: remove vcs support, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/983959 [00:33:12] (03CR) 10Brennen Bearnes: [C: 04-1] "We're not, to my knowledge, planning to remove Diffusion - or at least not yet and not without further discussion." [puppet] - 10https://gerrit.wikimedia.org/r/983957 (owner: 10Dzahn) [00:37:58] (03CR) 10Dzahn: "it's based on the comment below "enable_vcs" -> "# This exists to offer git services at git-ssh.wikimedia.org."" [puppet] - 10https://gerrit.wikimedia.org/r/983957 (owner: 10Dzahn) [00:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983242 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983242 (owner: 10TrainBranchBot) [00:58:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983242 (owner: 10TrainBranchBot) [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T353681 (10phaultfinder) [01:04:54] (03PS1) 10DDesouza: Undeploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983962 (https://phabricator.wikimedia.org/T351353) [01:44:40] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:44:52] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:45:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:01:12] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:01:24] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:01:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:36:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T0300) [03:06:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.10 [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/983243 (https://phabricator.wikimedia.org/T350086) [03:07:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.10 [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/983243 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [03:08:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:00] PROBLEM - cassandra-c service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:10:22] PROBLEM - Check systemd state on restbase2030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:10:50] PROBLEM - cassandra-c CQL 10.192.16.245:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.245 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [03:11:00] PROBLEM - cassandra-c SSL 10.192.16.245:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [03:25:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.10 [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/983243 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [03:27:12] (03PS1) 10RLazarus: admin_ng: Split the sidecar-job-controller role into two [deployment-charts] - 10https://gerrit.wikimedia.org/r/983963 (https://phabricator.wikimedia.org/T348284) [03:29:40] RECOVERY - cassandra-c service on restbase2030 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:30:02] RECOVERY - Check systemd state on restbase2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:24] RECOVERY - cassandra-c CQL 10.192.16.245:9042 on restbase2030 is OK: TCP OK - 0.030 second response time on 10.192.16.245 port 9042 https://phabricator.wikimedia.org/T93886 [03:30:34] RECOVERY - cassandra-c SSL 10.192.16.245:7000 on restbase2030 is OK: SSL OK - Certificate restbase2030-c valid until 2025-12-06 17:50:18 +0000 (expires in 718 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [03:38:32] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:59:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T0400) [04:01:50] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983965 (https://phabricator.wikimedia.org/T350086) [04:01:52] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983965 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [04:02:34] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983965 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [04:03:00] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.10 refs T350086 [04:03:08] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [04:04:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:21:57] (03PS4) 10KartikMistry: Update MinT to 2023-12-12-065316-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 [04:22:06] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:28:11] * kart_ deploying MinT.. [04:28:36] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-12-12-065316-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 (owner: 10KartikMistry) [04:29:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:30] (03Merged) 10jenkins-bot: Update MinT to 2023-12-12-065316-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/982645 (owner: 10KartikMistry) [04:32:36] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [04:36:45] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:40:01] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:43:39] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:49:14] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:49:48] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [04:54:03] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.10 refs T350086 (duration: 51m 03s) [04:54:08] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [04:56:08] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:02:06] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:06:26] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:10:38] !log Updated MinT to 2023-12-12-065316-production [05:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:46] (03PS2) 10Peter Fischer: Search update pipeline: enable commonwiki and wikidatawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/983759 [05:59:28] (03PS1) 10Marostegui: pc1016: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/983972 [06:00:36] (03CR) 10Marostegui: [C: 03+2] pc1016: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/983972 (owner: 10Marostegui) [06:03:54] (03CR) 10Peter Fischer: [C: 03+2] Search update pipeline: enable commonwiki and wikidatawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/983759 (owner: 10Peter Fischer) [06:05:09] (03Merged) 10jenkins-bot: Search update pipeline: enable commonwiki and wikidatawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/983759 (owner: 10Peter Fischer) [06:07:46] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:07:46] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [06:13:17] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:29:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:34:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T0700) [07:00:05] kormat, marostegui, and Amir1: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T0700). [07:38:33] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:49:30] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:56:14] (03PS1) 10Kevin Bazira: ml-services: bump CPUs to compare with Research team benchmarks [deployment-charts] - 10https://gerrit.wikimedia.org/r/983244 (https://phabricator.wikimedia.org/T353127) [08:00:04] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:06:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Note this could much simpler if you simply use systemd::override in a future patch :-)" [puppet] - 10https://gerrit.wikimedia.org/r/983746 (owner: 10FNegri) [08:10:08] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10akosiaris) For the record, in SEL this host had logged ` ------------------------------------------------------------------------------- Record: 2 Date/Time... [08:17:22] !log jmm@cumin1002 START - Cookbook sre.hosts.decommission for hosts lists1003.wikimedia.org [08:17:54] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:14] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:56] !log jmm@cumin1002 START - Cookbook sre.dns.netbox [08:26:27] !log jmm@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1002" [08:27:33] !log jmm@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1002" [08:27:33] !log jmm@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:27:34] !log jmm@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lists1003.wikimedia.org [08:27:38] 10SRE: Decommission lists1003 - https://phabricator.wikimedia.org/T353647 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1002 for hosts: `lists1003.wikimedia.org` - lists1003.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager - Found Ganeti VM - VM shutdown - S... [08:30:05] (03PS1) 10Muehlenhoff: Remove lists1003 from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/984099 (https://phabricator.wikimedia.org/T353647) [08:30:44] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:31:41] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10akosiaris) Thanks @taavi for setting this host to inactive. The CPU 1 machine check error was also logged one more time, ` ----------------------------------------------------------------------------... [08:32:17] (03PS1) 10Muehlenhoff: Remove lists1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/984101 (https://phabricator.wikimedia.org/T343647) [08:34:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove lists1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/984101 (https://phabricator.wikimedia.org/T343647) (owner: 10Muehlenhoff) [08:35:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983355 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:37:10] (03CR) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [08:37:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [08:40:38] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:28] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:56] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:00] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:48:38] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983355 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:49:56] (03CR) 10Muehlenhoff: [C: 03+2] On cumin1001 print a MOTD to use a different host [puppet] - 10https://gerrit.wikimedia.org/r/983355 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [08:49:58] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:52:04] 10SRE, 10Maps, 10Traffic, 10serviceops: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10MSantos) @Nicolas_Raoul thanks for reaching out. I am one of the main maintainers of Maps and maybe the person that can help the approval process, however I wil... [08:52:24] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:30] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:14] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:18] (03CR) 10Vgutierrez: "change looks good, but lists1003 is still on site.pp under role(lists)" [puppet] - 10https://gerrit.wikimedia.org/r/984099 (https://phabricator.wikimedia.org/T353647) (owner: 10Muehlenhoff) [09:02:56] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:03:05] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, ignore my last comment... L8 issues 😊" [puppet] - 10https://gerrit.wikimedia.org/r/984099 (https://phabricator.wikimedia.org/T353647) (owner: 10Muehlenhoff) [09:03:20] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:15] (03CR) 10Elukey: [C: 03+1] ml-services: bump CPUs to compare with Research team benchmarks [deployment-charts] - 10https://gerrit.wikimedia.org/r/983244 (https://phabricator.wikimedia.org/T353127) (owner: 10Kevin Bazira) [09:07:26] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:49] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/957/con" [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [09:09:45] (03CR) 10Filippo Giunchedi: [C: 04-1] verlib2: initial packaging (032 comments) [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [09:13:46] (03PS5) 10Filippo Giunchedi: thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) [09:14:40] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:32] (03Abandoned) 10Brouberol: spark-history: enable definition of spark env vars in spark-env.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/983748 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:22:03] (03PS1) 10Alexandros Kosiaris: Fix for services_proxy listen [puppet] - 10https://gerrit.wikimedia.org/r/984104 (https://phabricator.wikimedia.org/T255568) [09:22:05] (03PS1) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [09:23:49] !log reload thanos-rule on titan2001 [09:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:38] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:02] (03PS1) 10Brouberol: spark-history-analytics-hadoop: fix redirect and static links [deployment-charts] - 10https://gerrit.wikimedia.org/r/984127 (https://phabricator.wikimedia.org/T352863) [09:28:45] (03CR) 10Btullis: [C: 03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/983749 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:29:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix for services_proxy listen [puppet] - 10https://gerrit.wikimedia.org/r/984104 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:29:31] (03CR) 10Filippo Giunchedi: "LGTM! Only a comment re: parameter name" [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [09:29:38] (03PS1) 10Brouberol: httpd-yarn: proxy reqs with a /spark-history prefix to the spark-history svc [puppet] - 10https://gerrit.wikimedia.org/r/984128 (https://phabricator.wikimedia.org/T352863) [09:30:58] (03PS4) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [09:32:16] (03Abandoned) 10Brouberol: spark-history: set public DNS to yarn.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/983749 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:33:08] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:35:24] (03PS2) 10Brouberol: httpd-yarn: proxy reqs with a /spark-history prefix to the spark-history svc [puppet] - 10https://gerrit.wikimedia.org/r/984128 (https://phabricator.wikimedia.org/T352863) [09:37:04] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:21] (03CR) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:39:45] (03CR) 10Btullis: [C: 03+1] "I see, thanks. Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/984128 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:42:42] (03PS5) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [09:42:47] (03CR) 10Btullis: [C: 03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/984127 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:43:14] (03CR) 10Brouberol: [C: 03+2] spark-history-analytics-hadoop: fix redirect and static links [deployment-charts] - 10https://gerrit.wikimedia.org/r/984127 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:43:51] (03CR) 10Brouberol: [C: 03+2] httpd-yarn: proxy reqs with a /spark-history prefix to the spark-history svc [puppet] - 10https://gerrit.wikimedia.org/r/984128 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [09:45:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [09:45:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [09:46:04] (03CR) 10Btullis: "Nice. +1 from me, once Filippo's naming request has been completed." [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [09:46:39] (03CR) 10Btullis: [C: 03+2] Bump refine_sanitize refinery version to pickup fix for T349121 [puppet] - 10https://gerrit.wikimedia.org/r/983946 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [09:47:28] (03CR) 10Btullis: [C: 03+2] "Whoops. Apologies for the omission in the previous lpatch." [puppet] - 10https://gerrit.wikimedia.org/r/983946 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [09:48:10] (03CR) 10Volans: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:53:36] (03PS3) 10Btullis: Add kubeadm files for superset namespaces [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) [09:55:10] (03CR) 10Btullis: Add kubeadm files for superset namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [09:57:00] (03CR) 10Stevemunene: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/983720 (https://phabricator.wikimedia.org/T347710) (owner: 10Btullis) [10:05:02] (03Abandoned) 10Brouberol: yarn: proxy the spark job history requests to the spark history service [puppet] - 10https://gerrit.wikimedia.org/r/983193 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:06:41] (03PS1) 10Btullis: Retrict access to the spark-history k8s API tokens [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) [10:07:52] (03Abandoned) 10Brouberol: Configure the Spark History server host for the analytics yarn [puppet] - 10https://gerrit.wikimedia.org/r/981950 (https://phabricator.wikimedia.org/T352863) (owner: 10Brouberol) [10:08:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove lists1003 from acmechief config [puppet] - 10https://gerrit.wikimedia.org/r/984099 (https://phabricator.wikimedia.org/T353647) (owner: 10Muehlenhoff) [10:09:13] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/958/con" [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [10:09:35] (03PS1) 10Elukey: services: update rec-api's staging Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/984131 (https://phabricator.wikimedia.org/T205870) [10:10:10] (03CR) 10Muehlenhoff: [C: 03+2] rsync::server: Add support for creating nftables-compatible firewall services [puppet] - 10https://gerrit.wikimedia.org/r/983437 (owner: 10Muehlenhoff) [10:11:27] (03CR) 10Elukey: [C: 03+2] services: update rec-api's staging Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/984131 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [10:14:03] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [10:14:09] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Split the sidecar-job-controller role into two (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983963 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [10:14:18] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [10:15:02] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/983244 (https://phabricator.wikimedia.org/T353127) (owner: 10Kevin Bazira) [10:16:04] (03Merged) 10jenkins-bot: ml-services: bump CPUs to compare with Research team benchmarks [deployment-charts] - 10https://gerrit.wikimedia.org/r/983244 (https://phabricator.wikimedia.org/T353127) (owner: 10Kevin Bazira) [10:19:08] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:24:44] Hi WMDE-Mell [10:30:41] 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10elukey) >>! In T205870#9413962, @colewhite wrote: >>>! In T205870#9413501, @elukey wrote: >> Tried to deploy rec-api without the statsd exporter, all good but the me... [10:31:33] (03PS6) 10Slyngshede: C:puppetmaster::monitoring Prometheus stats for Puppetmerge. [puppet] - 10https://gerrit.wikimedia.org/r/983376 (https://phabricator.wikimedia.org/T350694) [10:36:40] 10SRE: Decommission lists1003 - https://phabricator.wikimedia.org/T353647 (10MoritzMuehlenhoff) 05Open→03Resolved This is completed [10:36:47] (03PS2) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [10:36:49] (03PS1) 10Alexandros Kosiaris: envoy-build-config: call extend() instead of append() if passed a list [puppet] - 10https://gerrit.wikimedia.org/r/984133 (https://phabricator.wikimedia.org/T255568) [10:36:51] (03PS1) 10Alexandros Kosiaris: Don't override the IPv6 stanza in services_proxy/envoy_service_listener.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984134 (https://phabricator.wikimedia.org/T255568) [10:37:50] (03PS2) 10Btullis: Retrict access to the spark-history k8s API tokens [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) [10:37:53] (03PS2) 10Alexandros Kosiaris: envoy-build-config: call extend() instead of append() if passed a list [puppet] - 10https://gerrit.wikimedia.org/r/984133 (https://phabricator.wikimedia.org/T255568) [10:37:55] (03PS2) 10Alexandros Kosiaris: Don't override the IPv6 stanza in services_proxy/envoy_service_listener.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984134 (https://phabricator.wikimedia.org/T255568) [10:37:57] (03PS3) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [10:38:44] (03Abandoned) 10Btullis: Add presto keytabs to the cluster coordinator replica role [puppet] - 10https://gerrit.wikimedia.org/r/709737 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [10:40:09] (03CR) 10Brouberol: [C: 03+1] "Oh, good thinking! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/984130 (https://phabricator.wikimedia.org/T330176) (owner: 10Btullis) [10:42:04] (03Abandoned) 10Hnowlan: jobqueue: increase concurrency for thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/971467 (owner: 10Hnowlan) [10:42:45] (03PS1) 10AikoChou: ml-services: change EventGate stream value for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/984135 (https://phabricator.wikimedia.org/T349919) [10:43:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] envoy-build-config: call extend() instead of append() if passed a list [puppet] - 10https://gerrit.wikimedia.org/r/984133 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [10:43:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] Don't override the IPv6 stanza in services_proxy/envoy_service_listener.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984134 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [10:43:36] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [10:46:29] !log installing perl security updates on bookworm [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:32] (03CR) 10Elukey: [C: 03+1] ml-services: change EventGate stream value for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/984135 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [10:49:40] I am going to try something out on mwdebug1001; shout at me if you’re doing anything that could interfere (especially on the deployment server) [10:50:16] * MichaelG_WMDE is here as well and in the same call as Lucas_WMDE [10:52:12] (03PS3) 10Ilias Sarantopoulos: testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) [10:52:19] I’m going to pull https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/984136 directly to the deployment server and from there to mwdebug1001 (assuming Phan doesn’t complain) [10:52:23] without merging it or deploying it anywhere else [10:52:24] (03PS4) 10Ilias Sarantopoulos: testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) [10:52:29] so nobody else scap from the deployment server during that time please :) [10:52:43] (it wouldn’t be a huge disaster, hopefully, but I don’t need nor want this code outside the debug servers ^^) [10:52:49] I’ll git reset it on the deployment host afterwards [10:52:56] (and hopefully before the next deployment window begins) [10:56:04] switching wmf.9 extensions/Wikibase to PS2 of that change now [10:56:24] (previously on e1622f036ce8d17fafcc2896f75a2dec76f3ed0d) [10:56:58] pulled to mwdebug1001… [10:58:38] (03PS4) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [10:58:40] (03PS1) 10Alexandros Kosiaris: Fix indentation in envoy_service_listener_yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984137 [10:58:53] reset the deployment checkout… [10:58:58] and scap pulled on mwdebug1001 again [10:59:03] * Lucas_WMDE all done for now [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1100) [11:01:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Enough clusters are showing diff, owned by many different teams to make this harder than I originally expected https://puppet-compiler.wmf" [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:01:29] (WidespreadPuppetFailure) firing: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:02:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix indentation in envoy_service_listener_yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984137 (owner: 10Alexandros Kosiaris) [11:04:36] (03CR) 10Kosta Harlan: [C: 03+1] testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [11:07:30] (03CR) 10AikoChou: [C: 03+2] ml-services: change EventGate stream value for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/984135 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [11:08:35] (03Merged) 10jenkins-bot: ml-services: change EventGate stream value for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/984135 (https://phabricator.wikimedia.org/T349919) (owner: 10AikoChou) [11:11:29] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:18:44] (03PS3) 10WMDE-Fisch: Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) [11:19:14] (03PS4) 10WMDE-Fisch: Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [11:20:16] (03CR) 10WMDE-Fisch: "PS3: Manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [11:20:25] (03CR) 10Mareike Heuer: [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [11:20:30] (03CR) 10Mareike Heuer: [C: 03+1] Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [11:24:25] (03PS1) 10Filippo Giunchedi: oauth2-proxy: use the same configuration as jaeger chart [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984139 (https://phabricator.wikimedia.org/T320555) [11:27:28] (03CR) 10Clément Goubert: [C: 03+1] k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [11:30:18] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [11:31:16] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:38:14] (03PS1) 10Hnowlan: changeprop-jobqueue: move AssembleUploadChunks back to metal temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/984140 (https://phabricator.wikimedia.org/T352515) [11:39:13] (03PS1) 10Elukey: Set ipv6dualstack for ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/984141 (https://phabricator.wikimedia.org/T353622) [11:41:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/959/con" [puppet] - 10https://gerrit.wikimedia.org/r/984141 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [11:46:27] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Clement_Goubert) >>! In T271142#9413333, @akosiaris wrote: >>>! In T271142#9382040, @Volans wrote: >> Another... [11:46:29] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:47:35] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: move AssembleUploadChunks back to metal temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/984140 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:47:56] (03CR) 10JMeybohm: [C: 03+1] "Please keep in mind that this might not have any effect on already created service and/or pod objects" [puppet] - 10https://gerrit.wikimedia.org/r/984141 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [11:49:59] (03PS3) 10JMeybohm: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) [11:50:45] (03PS1) 10Alexandros Kosiaris: Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 [11:51:29] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:51:43] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:51:44] (03PS2) 10Alexandros Kosiaris: Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 [11:52:25] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984145 (owner: 10Alexandros Kosiaris) [11:55:26] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: move AssembleUploadChunks back to metal temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/984140 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:55:31] (03CR) 10Clément Goubert: [C: 03+1] Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 (owner: 10Alexandros Kosiaris) [11:56:32] (03Merged) 10jenkins-bot: changeprop-jobqueue: move AssembleUploadChunks back to metal temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/984140 (https://phabricator.wikimedia.org/T352515) (owner: 10Hnowlan) [11:57:40] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:52] (03PS1) 10Filippo Giunchedi: oauth2_proxy: skip provider button [puppet] - 10https://gerrit.wikimedia.org/r/984146 (https://phabricator.wikimedia.org/T331512) [12:00:16] (03PS12) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [12:00:34] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/984147 [12:00:47] (03PS3) 10Alexandros Kosiaris: Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 [12:01:00] (03PS3) 10JMeybohm: Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) [12:01:18] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984145 (owner: 10Alexandros Kosiaris) [12:02:00] (03CR) 10JMeybohm: Add more calico alerts (037 comments) [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [12:02:10] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:12] (03CR) 10Clément Goubert: [C: 03+1] testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [12:02:28] (03CR) 10Slyngshede: Move Debmonitor client code to separate repository. (032 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:02:39] (03CR) 10Slyngshede: [C: 03+2] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:02:55] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:04:26] (03Merged) 10jenkins-bot: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:06:10] (03PS1) 10Kosta Harlan: Send PhotoDNA the mime type of the thumbnail and not original file [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984166 (https://phabricator.wikimedia.org/T351401) [12:07:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 (owner: 10Alexandros Kosiaris) [12:07:21] (03PS4) 10Alexandros Kosiaris: Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 [12:07:24] (03CR) 10Alexandros Kosiaris: [V: 03+2] Indent correctly envoy_service_listener_af_common.yaml.erb [puppet] - 10https://gerrit.wikimedia.org/r/984145 (owner: 10Alexandros Kosiaris) [12:10:55] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MoritzMuehlenhoff) I created a PoC forward port of OpenSSL 1.1.1w which is co-installable with the OpenSSL packages from Bookworm. The following binary packages are built: > dpkg-deb: building package... [12:12:11] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [12:13:07] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/984147 (owner: 10Muehlenhoff) [12:21:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP [12:21:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: WIP [12:22:05] ACKNOWLEDGEMENT - LDAP -writable server- on ldap-rw1001 is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org Muehlenhoff In setup https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [12:22:05] ACKNOWLEDGEMENT - LDAP -writable server- on ldap-rw2001 is CRITICAL: Could not search/find objectclasses in dc=wikimedia,dc=org Muehlenhoff In setup https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [12:24:29] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:24:38] (03CR) 10Muehlenhoff: [C: 03+2] testreduce: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/982419 (owner: 10Muehlenhoff) [12:24:46] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:26:04] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:12] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:37:49] (03PS1) 10Muehlenhoff: Switch testreduce to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984159 [12:42:56] (03CR) 10Slyngshede: "recheck" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [12:43:29] (03PS1) 10Muehlenhoff: peopleweb: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/984160 [12:43:31] (03PS1) 10Muehlenhoff: Switch peopleweb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984161 [12:43:44] (03PS1) 10Jgiannelos: proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984162 [12:44:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984160 (owner: 10Muehlenhoff) [12:45:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984159 (owner: 10Muehlenhoff) [12:45:48] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984162 (owner: 10Jgiannelos) [12:46:17] (03PS3) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [12:46:42] (03Merged) 10jenkins-bot: proton: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/984162 (owner: 10Jgiannelos) [12:46:51] (03CR) 10CI reject: [V: 04-1] Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [12:52:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984161 (owner: 10Muehlenhoff) [12:54:41] (03PS1) 10Muehlenhoff: piwik: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/984163 [12:56:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984163 (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1300) [13:00:35] (03PS1) 10Muehlenhoff: Switch piwik to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984164 [13:02:06] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [13:03:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984164 (owner: 10Muehlenhoff) [13:05:08] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:05:16] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:05:19] (03PS1) 10Kosta Harlan: statsd: Log check attempt failures [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984168 (https://phabricator.wikimedia.org/T353441) [13:08:08] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:08:09] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:10:54] I am trying to deploy a chromium-render version bump but while trying staging deployment is stuck with status pending. Can somebody help me with that ? [13:10:59] cc hnowlan ^ [13:12:09] It looks like there is a problem scheduling the pod: ` 0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.` [13:12:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:12:26] which is a first to me when deploying changes [13:12:49] Also just now deployment failed [13:15:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [13:17:52] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@40c15b1]: (no justification provided) [13:22:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [13:22:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove wgPopupsReferencePreviews now that it defaults to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978035 (https://phabricator.wikimedia.org/T351708) (owner: 10Awight) [13:23:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [13:27:45] (03PS1) 10Ayounsi: Release v0.6.5 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/984191 [13:28:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/984191 (owner: 10Ayounsi) [13:29:18] (03CR) 10Elukey: [V: 03+1 C: 03+2] Set ipv6dualstack for ml-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/984141 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:29:57] (03CR) 10Ayounsi: [C: 03+2] "No breaking change in Paramiko's version bump: https://www.paramiko.org/changelog.html" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/984191 (owner: 10Ayounsi) [13:31:30] (03PS1) 10Majavah: dynamicproxy: increase max client body size [puppet] - 10https://gerrit.wikimedia.org/r/984192 (https://phabricator.wikimedia.org/T353698) [13:32:49] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin1001.eqiad.wmnet with reason: Release v0.6.5 - ayounsi@cumin1001 [13:33:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1001.eqiad.wmnet with reason: Release v0.6.5 - ayounsi@cumin1001 [13:35:57] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.6.5 - ayounsi@cumin1001 [13:36:40] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver-safe-restart.service,kube-apiserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:42] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:36:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.6.5 - ayounsi@cumin1001 [13:36:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:36:56] (ProbeDown) firing: (4) Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:09] this is me --^ [13:37:16] ipv6 dual stack didn't work as expected [13:38:04] (03CR) 10FNegri: [C: 03+1] dynamicproxy: increase max client body size [puppet] - 10https://gerrit.wikimedia.org/r/984192 (https://phabricator.wikimedia.org/T353698) (owner: 10Majavah) [13:38:12] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver-safe-restart.service,kube-apiserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:16] (03CR) 10Majavah: [C: 03+2] dynamicproxy: increase max client body size [puppet] - 10https://gerrit.wikimedia.org/r/984192 (https://phabricator.wikimedia.org/T353698) (owner: 10Majavah) [13:38:50] (KubernetesAPINotScrapable) firing: k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:41:18] (03PS2) 10Kosta Harlan: Add maintenance script to scan files in the mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984169 (https://phabricator.wikimedia.org/T351399) [13:41:41] (03Abandoned) 10Kosta Harlan: statsd: Log check attempt failures [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984168 (https://phabricator.wikimedia.org/T353441) (owner: 10Kosta Harlan) [13:42:03] (03CR) 10Kosta Harlan: "This cherry-pick also includes I6c60e471c46dfcf9403c494587e6cae4a344d03a" [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984169 (https://phabricator.wikimedia.org/T351399) (owner: 10Kosta Harlan) [13:45:18] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@40c15b1]: (no justification provided) (duration: 27m 26s) [13:47:27] (03PS1) 10Elukey: Revert "Set ipv6dualstack for ml-staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/984170 [13:48:28] (03CR) 10Elukey: [C: 03+2] Revert "Set ipv6dualstack for ml-staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/984170 (owner: 10Elukey) [13:50:37] (03PS1) 10Slyngshede: Review access change [software/debmonitor-client] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/984171 [13:51:03] the issue was https://phabricator.wikimedia.org/T335285 [13:51:10] I am reverting, should recover in a bit [13:51:36] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:40] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:18] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:53:33] (ProbeDown) firing: (4) Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:50] (KubernetesAPINotScrapable) resolved: k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:56:08] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:56:56] (ProbeDown) resolved: (4) Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:58:00] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.006e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:59:14] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Review access change [software/debmonitor-client] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/984171 (owner: 10Slyngshede) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1400). [14:00:05] isaranto and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:46] o_O ReferencePreviews beta stuff not being removed after all? [14:01:04] here [14:02:00] (03PS3) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) [14:02:29] (03PS1) 10Kosta Harlan: WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 [14:02:50] alright, I can deploy [14:02:57] isaranto: around? [14:03:07] hey I'm here! [14:03:16] ok! [14:03:26] (03PS4) 10Slyngshede: Debian packaging configuration [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 [14:03:40] (03CR) 10Slyngshede: "recheck" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [14:03:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [14:04:28] (03Merged) 10jenkins-bot: testwiki: enable revertrisk model in ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983747 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [14:04:31] (03CR) 10Urbanecm: [C: 04-1] Temporary users: set notifyBeforeExpirationDays same as expireAfterDays (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) (owner: 10Sergio Gimeno) [14:04:47] (03PS2) 10Sergio Gimeno: Temporary users: set notifyBeforeExpirationDays same as expireAfterDays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) [14:04:59] kostajh: shouldn’t one of the backports be based on the other? [14:05:10] I don’t think we usually do merge commits on the wmf branches [14:05:22] (03CR) 10CI reject: [V: 04-1] WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (owner: 10Kosta Harlan) [14:05:24] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10Patch-For-Review: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428 (10Volans) Thanks for the patch, it would take a bit to do a full pass given the size. I agree... [14:05:28] (03CR) 10CI reject: [V: 04-1] Temporary users: set notifyBeforeExpirationDays same as expireAfterDays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) (owner: 10Sergio Gimeno) [14:05:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit ahead of deployment" [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984166 (https://phabricator.wikimedia.org/T351401) (owner: 10Kosta Harlan) [14:05:34] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:983747|testwiki: enable revertrisk model in ores extension (T348298)]] [14:05:40] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [14:05:46] Lucas_WMDE: I think they are both independent of each other? [14:06:01] you should be able to `scap backport` both change IDs together, I think [14:06:29] I did have to cherry-pick another proposed backport in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/984169/2#message-c484416faf05c6056214245119dd621fddaa4607 [14:06:54] (03CR) 10Bking: prometheus-blackbox: Expose ability to add arbitrary headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [14:07:11] (03PS3) 10Lucas Werkmeister (WMDE): Add maintenance script to scan files in the mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984169 (https://phabricator.wikimedia.org/T351399) (owner: 10Kosta Harlan) [14:07:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit ahead of deployment" [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984169 (https://phabricator.wikimedia.org/T351399) (owner: 10Kosta Harlan) [14:07:45] I rebased the second one now [14:08:02] !log lucaswerkmeister-wmde@deploy2002 isaranto and lucaswerkmeister-wmde: Backport for [[gerrit:983747|testwiki: enable revertrisk model in ores extension (T348298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:11] (03Merged) 10jenkins-bot: Send PhotoDNA the mime type of the thumbnail and not original file [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984166 (https://phabricator.wikimedia.org/T351401) (owner: 10Kosta Harlan) [14:08:21] testing! [14:08:22] ok! [14:08:35] (03PS1) 10Kamila Součková: Lower a few random CPU requests to unbreak staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984198 [14:10:19] (03Merged) 10jenkins-bot: Add maintenance script to scan files in the mediamoderation_scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984169 (https://phabricator.wikimedia.org/T351399) (owner: 10Kosta Harlan) [14:10:21] everything seems fine on my end. I manually triggered a job on mwdebug [14:10:25] ok, thanks! [14:10:27] !log lucaswerkmeister-wmde@deploy2002 isaranto and lucaswerkmeister-wmde: Continuing with sync [14:10:34] thank you! [14:11:37] (03CR) 10Volans: "couple of questions inline" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [14:15:56] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:983747|testwiki: enable revertrisk model in ores extension (T348298)]] (duration: 10m 22s) [14:16:07] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [14:16:23] alright, let’s do MediaModeration then [14:16:24] (03CR) 10Brouberol: [C: 03+1] wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [14:16:48] (03CR) 10Brouberol: [C: 03+1] wdqs: Enable ipv6 for envoy tls_terminator [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:17:05] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]] [14:17:11] T351401: Create service(s) to send an image to PhotoDNA for a scan - https://phabricator.wikimedia.org/T351401 [14:17:11] T351399: Create a maintenance script to automatically scan files listed in mediamoderation_scan - https://phabricator.wikimedia.org/T351399 [14:17:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] Lower a few random CPU requests to unbreak staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984198 (owner: 10Kamila Součková) [14:17:59] (03CR) 10Kamila Součková: [C: 03+2] Lower a few random CPU requests to unbreak staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984198 (owner: 10Kamila Součková) [14:18:22] (03CR) 10Bking: [C: 03+2] wdqs ldf: remove useragent param [puppet] - 10https://gerrit.wikimedia.org/r/983947 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [14:18:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kharlan: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:47] kostajh: can you test the change? [14:18:55] (on testwiki, I guess, since not much else is on wmf.10 yet ^^) [14:19:05] (03Merged) 10jenkins-bot: Lower a few random CPU requests to unbreak staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984198 (owner: 10Kamila Součková) [14:19:13] Lucas_WMDE: we have a deployment plan that involves a couple of steps, so I'd prefer to just sync it. [14:19:20] ok [14:19:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kharlan: Continuing with sync [14:19:35] jouncebot: next [14:19:35] In 1 hour(s) and 40 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1600) [14:19:58] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:20:01] (I have some backports I’d like to do afterwards, but due to meetings they might have to happen after the window) [14:21:08] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:21:29] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:22:16] (03CR) 10JMeybohm: [C: 03+2] Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [14:22:37] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:23:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] oauth2-proxy: use the same configuration as jaeger chart [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984139 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:23:31] (03Merged) 10jenkins-bot: Add more calico alerts [alerts] - 10https://gerrit.wikimedia.org/r/983446 (https://phabricator.wikimedia.org/T353463) (owner: 10JMeybohm) [14:24:13] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:24:28] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:43] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:24:54] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:24:59] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]] (duration: 07m 53s) [14:24:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [14:25:06] T351401: Create service(s) to send an image to PhotoDNA for a scan - https://phabricator.wikimedia.org/T351401 [14:25:07] T351399: Create a maintenance script to automatically scan files listed in mediamoderation_scan - https://phabricator.wikimedia.org/T351399 [14:25:59] !log UTC afternoon backport+config window done [14:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:13] (I might still do backports before the next real window starts but we can consider this window done for now) [14:26:14] (03CR) 10Bking: [C: 03+2] wdqs: Enable ipv6 for envoy tls_terminator [puppet] - 10https://gerrit.wikimedia.org/r/983893 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:27:10] (03Merged) 10jenkins-bot: cli.py: Improve help text for docker-pkg help 'select' argument [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/983763 (owner: 10Ahmon Dancy) [14:29:29] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:29:32] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:30:16] Lucas_WMDE: thanks for running the backports [14:30:24] np :) [14:30:49] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [14:31:30] no meeting after all \o/ [14:31:34] * Lucas_WMDE sets up backports [14:31:47] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:32:02] (03CR) 10Herron: [C: 03+1] thanos: add bucket query tools [puppet] - 10https://gerrit.wikimedia.org/r/983752 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [14:32:22] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:32:22] (03PS1) 10Lucas Werkmeister (WMDE): Make SearchEntitiesIntegrationTest an ApiTestCase [extensions/WikibaseCirrusSearch] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984172 (https://phabricator.wikimedia.org/T353334) [14:32:29] (03PS1) 10Lucas Werkmeister (WMDE): Use link batch in search APIs [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) [14:32:46] (03PS1) 10Aqu: Airflow metrics configuration adjustement [puppet] - 10https://gerrit.wikimedia.org/r/984200 (https://phabricator.wikimedia.org/T349532) [14:32:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984172 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:32:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:33:16] (sadly our CI is somewhat slower than MediaModeration’s ;_;) [14:33:43] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:34:16] (03PS3) 10Herron: pyrra: onboard varnish-requests as pilot SLO [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) [14:35:51] Lucas_WMDE: I might try to backport one fix to MediaModeration [14:35:59] as we found an issue while running it on testwiki [14:36:06] ah, ok [14:36:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:51] (03PS3) 10Sergio Gimeno: Temporary users: set notifyBeforeExpirationDays same as expireAfterDays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983755 (https://phabricator.wikimedia.org/T344694) [14:38:27] (03PS1) 10AikoChou: ml-services: update outlink topic model image on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/984203 (https://phabricator.wikimedia.org/T352834) [14:38:33] (03CR) 10Herron: [C: 03+2] pyrra: onboard varnish-requests as pilot SLO [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:38:39] ugh, one of my builds already failed randomly [14:38:46] (03PS5) 10Alexandros Kosiaris: services_proxy: Switch listen_ipv6 to true by default [puppet] - 10https://gerrit.wikimedia.org/r/984105 (https://phabricator.wikimedia.org/T255568) [14:38:48] (03PS1) 10Alexandros Kosiaris: Fix insetup role for restbase203[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/984204 (https://phabricator.wikimedia.org/T352468) [14:39:55] jouncebot: nowandnext [14:39:55] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1400) [14:39:55] In 1 hour(s) and 20 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1600) [14:40:19] Lucas_WMDE: if you have some free time, we can squeeze this noop patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/983758 [14:41:17] I’m currently waiting for gate-and-submit but I’ll see if I can fit it in later [14:41:25] Amir1: I assume the name isn’t used anywhere else yet? [14:41:32] yup [14:41:35] ok [14:41:46] still waiting for review https://gerrit.wikimedia.org/r/c/mediawiki/core/+/976765/ [14:41:46] (03PS1) 10Kosta Harlan: Use main replica DB in importExistingFilesToScanTable.php [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984174 [14:41:53] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/984174 is my backport [14:41:59] I'll add to the calendar [14:42:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:43:19] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:43:35] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:43:43] (03PS2) 10Alexandros Kosiaris: Fix insetup role for restbase203[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/984204 (https://phabricator.wikimedia.org/T352468) [14:44:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:44:12] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/984200 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:44:36] (03PS2) 10Kosta Harlan: WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 [14:44:37] (updated the calendar) [14:44:53] (03PS1) 10FNegri: Revert "[toolsdb] Lower innodb_buffer_pool_size" [puppet] - 10https://gerrit.wikimedia.org/r/984207 [14:45:37] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [14:45:57] (03CR) 10Bking: [C: 03+2] prometheus-blackbox: Expose ability to add arbitrary headers [puppet] - 10https://gerrit.wikimedia.org/r/983948 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [14:46:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix insetup role for restbase203[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/984204 (https://phabricator.wikimedia.org/T352468) (owner: 10Alexandros Kosiaris) [14:46:38] kostajh: ok [14:46:58] (03PS2) 10FNegri: Revert "[toolsdb] Lower innodb_buffer_pool_size" [puppet] - 10https://gerrit.wikimedia.org/r/984207 (https://phabricator.wikimedia.org/T353093) [14:47:31] (03CR) 10CI reject: [V: 04-1] WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (owner: 10Kosta Harlan) [14:48:43] (03CR) 10Filippo Giunchedi: [C: 03+2] oauth2-proxy: use the same configuration as jaeger chart [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984139 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:48:45] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] oauth2-proxy: use the same configuration as jaeger chart [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/984139 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:48:47] (03CR) 10FNegri: "There are now >40G of free memory in the instance, so we can set this back to its original value." [puppet] - 10https://gerrit.wikimedia.org/r/984207 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [14:49:20] (03CR) 10Majavah: [C: 03+1] Revert "[toolsdb] Lower innodb_buffer_pool_size" [puppet] - 10https://gerrit.wikimedia.org/r/984207 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [14:50:47] (03PS1) 10Elukey: kubernetes: update IPv6 service IP ranges for ML clusters [puppet] - 10https://gerrit.wikimedia.org/r/984209 (https://phabricator.wikimedia.org/T353705) [14:51:09] kostajh: is that backport urgent btw? ^^ [14:51:31] (since mine is having issues, but I’d like to finish it first and not leave the deployment branches inconsistent) [14:51:38] (03CR) 10CI reject: [V: 04-1] Make SearchEntitiesIntegrationTest an ApiTestCase [extensions/WikibaseCirrusSearch] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984172 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:51:53] Lucas_WMDE: it could wait until later, but would it be possible to run after yours? [14:51:58] (03Merged) 10jenkins-bot: Make SearchEntitiesIntegrationTest an ApiTestCase [extensions/WikibaseCirrusSearch] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984172 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:52:36] kostajh: yup, I was just wondering if it should happen in between [14:52:39] I can come back to it in the later evening backport window, if need be. But would prefer not to, if we can sync it after yours [14:53:04] (03CR) 10CI reject: [V: 04-1] Use link batch in search APIs [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:53:30] (03CR) 10Lucas Werkmeister (WMDE): "flaky, try that again" [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:53:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:53:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [14:56:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:17] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update outlink topic model image on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/984203 (https://phabricator.wikimedia.org/T352834) (owner: 10AikoChou) [14:57:46] Lucas_WMDE: I'm stepping away, no need to verify the patch, it can just sync out whenever is convenient. [14:57:53] ok [14:58:31] If you want any clarification etc., I am around too (I wrote the patch that is being backported). [14:58:43] (03PS1) 10Bking: wdqs: Add Accept: header to LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/984212 (https://phabricator.wikimedia.org/T353672) [14:59:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Use main replica DB in importExistingFilesToScanTable.php [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984174 (owner: 10Kosta Harlan) [14:59:33] Dreamy_Jazz: ack, thanks! [15:00:28] (03PS2) 10Bking: wdqs: Add Accept: header to LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/984212 (https://phabricator.wikimedia.org/T353672) [15:00:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984212 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [15:01:19] (03PS1) 10Aklapper: AVA: Remove unused variable; take age into account [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/984213 (https://phabricator.wikimedia.org/T338611) [15:01:27] (03PS1) 10Elukey: admin_ng: set new Istio Service Entry for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/984214 (https://phabricator.wikimedia.org/T353622) [15:01:29] (03PS1) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622) [15:01:42] (03CR) 10FNegri: [C: 03+2] Revert "[toolsdb] Lower innodb_buffer_pool_size" [puppet] - 10https://gerrit.wikimedia.org/r/984207 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [15:02:25] (03PS1) 10Ayounsi: Netbox report, reduce alerting spam [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) [15:05:35] (03CR) 10Muehlenhoff: Debian packaging configuration (032 comments) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/982391 (owner: 10Slyngshede) [15:06:09] (03PS1) 10FNegri: [toolsdb] remove leftover template [puppet] - 10https://gerrit.wikimedia.org/r/984218 [15:07:43] (03CR) 10FNegri: [C: 03+2] mariadb::service chmod override file [puppet] - 10https://gerrit.wikimedia.org/r/983746 (owner: 10FNegri) [15:08:03] (03PS1) 10JMeybohm: Alert on containers being OOM killed frequently [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [15:08:36] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/984220 [15:10:02] !log installing nagios-plugins-contrib bugfix updates from Bookworm point release [15:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:40] (03CR) 10Filippo Giunchedi: [C: 03+1] wdqs: Add Accept: header to LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/984212 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [15:11:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Netbox report, reduce alerting spam [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) (owner: 10Ayounsi) [15:12:15] (03PS2) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622) [15:13:15] (03Merged) 10jenkins-bot: Use link batch in search APIs [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984173 (https://phabricator.wikimedia.org/T353334) (owner: 10Lucas Werkmeister (WMDE)) [15:13:25] yay [15:13:41] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:984172|Make SearchEntitiesIntegrationTest an ApiTestCase (T353334)]], [[gerrit:984173|Use link batch in search APIs (T353334)]] [15:13:48] T353334: Batch page ID lookups in Wikibase entity search APIs - https://phabricator.wikimedia.org/T353334 [15:15:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:984172|Make SearchEntitiesIntegrationTest an ApiTestCase (T353334)]], [[gerrit:984173|Use link batch in search APIs (T353334)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:15:25] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [15:15:50] testing [15:15:53] !log installing exim4 bugfix updates from Bookworm point release [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:55] seems to have improved things [15:16:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:20:26] (03PS2) 10Herron: pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) [15:21:52] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/962/con" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [15:22:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:984172|Make SearchEntitiesIntegrationTest an ApiTestCase (T353334)]], [[gerrit:984173|Use link batch in search APIs (T353334)]] (duration: 08m 49s) [15:22:35] T353334: Batch page ID lookups in Wikibase entity search APIs - https://phabricator.wikimedia.org/T353334 [15:22:52] alright, kostajh / Dreamy_Jazz next [15:22:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984174 (owner: 10Kosta Harlan) [15:23:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:23:18] Thanks [15:23:33] (ProbeDown) firing: (3) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:35] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: host is down, downtiming in icinga too [15:23:51] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: host is down, downtiming in icinga too [15:24:00] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0cee941c-9871-4463-b392-d45794163f4d) set by taavi@cumin1001 for 30 days, 0:00:00 on 1 hos... [15:24:42] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:25:30] (03Merged) 10jenkins-bot: Use main replica DB in importExistingFilesToScanTable.php [extensions/MediaModeration] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/984174 (owner: 10Kosta Harlan) [15:25:56] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:984174|Use main replica DB in importExistingFilesToScanTable.php]] [15:26:29] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10Marostegui) [15:26:54] (JobUnavailable) firing: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:11] (03PS2) 10JMeybohm: Alert on containers being OOM killed frequently [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [15:27:44] !log lucaswerkmeister-wmde@deploy2002 kharlan and lucaswerkmeister-wmde: Backport for [[gerrit:984174|Use main replica DB in importExistingFilesToScanTable.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:27:52] nothing to test, or so I’ve heard [15:27:54] !log lucaswerkmeister-wmde@deploy2002 kharlan and lucaswerkmeister-wmde: Continuing with sync [15:28:02] jouncebot: next [15:28:02] In 0 hour(s) and 31 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1600) [15:28:05] k [15:28:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:28:33] (JobUnavailable) firing: (2) Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/963/console" [puppet] - 10https://gerrit.wikimedia.org/r/984209 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [15:30:14] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:54] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:33] (03CR) 10Volans: [C: 03+1] "Makes sense to me for now, at least the holidays" [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) (owner: 10Ayounsi) [15:33:19] (03PS3) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [15:33:43] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:984174|Use main replica DB in importExistingFilesToScanTable.php]] (duration: 07m 47s) [15:34:01] (03PS2) 10Aqu: Airflow metrics configuration adjustement [puppet] - 10https://gerrit.wikimedia.org/r/984200 (https://phabricator.wikimedia.org/T349532) [15:34:23] Dreamy_Jazz: should be done, if you want to test now [15:34:30] (03PS2) 10Lucas Werkmeister (WMDE): Change virtual domain of botpassword to plural [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983758 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [15:34:34] (03CR) 10CI reject: [V: 04-1] Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [15:34:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983758 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [15:35:04] (03PS2) 10Ayounsi: Netbox report, reduce alerting spam [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) [15:35:15] (03CR) 10JMeybohm: [C: 03+1] kubernetes: update IPv6 service IP ranges for ML clusters [puppet] - 10https://gerrit.wikimedia.org/r/984209 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [15:35:25] (03Merged) 10jenkins-bot: Change virtual domain of botpassword to plural [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983758 (https://phabricator.wikimedia.org/T351559) (owner: 10Ladsgroup) [15:35:50] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:983758|Change virtual domain of botpassword to plural (T351559)]] [15:35:54] T351559: Migrate bot passwords to use a virtual database domain - https://phabricator.wikimedia.org/T351559 [15:36:32] (03CR) 10Ayounsi: [C: 03+2] Netbox report, reduce alerting spam [puppet] - 10https://gerrit.wikimedia.org/r/984217 (https://phabricator.wikimedia.org/T321704) (owner: 10Ayounsi) [15:37:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and ladsgroup: Backport for [[gerrit:983758|Change virtual domain of botpassword to plural (T351559)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:37:20] hm, apparently I caused a brief spike in errors [15:37:25] (~15 minutes ago) [15:37:29] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and ladsgroup: Continuing with sync [15:37:39] I guess the deployment wasn’t as atomic as I thought it would be 😔 [15:37:51] I’ll try to keep that in mind… nothing else to be done about it now afaict [15:38:17] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [15:38:21] (none of the errors seem to be from k8s hosts, which is nice) [15:38:50] !log installing gnutls28 security updates on bookworm [15:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:54] PROBLEM - cassandra-c service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:38:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10ayounsi) [15:39:04] PROBLEM - cassandra-c CQL 10.192.16.239:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.239 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:39:14] PROBLEM - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:40:04] PROBLEM - Check systemd state on restbase2028 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:24] (03PS4) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [15:41:37] (03CR) 10CI reject: [V: 04-1] Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [15:41:54] (03CR) 10Elukey: [V: 03+1 C: 03+2] kubernetes: update IPv6 service IP ranges for ML clusters [puppet] - 10https://gerrit.wikimedia.org/r/984209 (https://phabricator.wikimedia.org/T353705) (owner: 10Elukey) [15:42:51] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:983758|Change virtual domain of botpassword to plural (T351559)]] (duration: 07m 01s) [15:42:55] T351559: Migrate bot passwords to use a virtual database domain - https://phabricator.wikimedia.org/T351559 [15:43:15] (03CR) 10JMeybohm: Alert for containers with memory issues (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) (owner: 10JMeybohm) [15:43:27] Amir1: done [15:43:32] * Lucas_WMDE all done [15:43:40] Thanks for the deploy. [15:45:16] (03PS1) 10Elukey: Revert "Revert "Set ipv6dualstack for ml-staging-codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/984178 [15:46:02] RECOVERY - Check systemd state on restbase2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:04] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Set ipv6dualstack for ml-staging-codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/984178 (owner: 10Elukey) [15:46:22] RECOVERY - cassandra-c service on restbase2028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:46:40] RECOVERY - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-c valid until 2025-12-03 21:33:03 +0000 (expires in 715 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:47:58] RECOVERY - cassandra-c CQL 10.192.16.239:9042 on restbase2028 is OK: TCP OK - 0.031 second response time on 10.192.16.239 port 9042 https://phabricator.wikimedia.org/T93886 [15:48:28] Lucas_WMDE: Thanks! [15:48:34] (JobUnavailable) firing: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:36] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:51:54] (JobUnavailable) resolved: (3) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:56] (ProbeDown) resolved: (2) Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:17] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Volans) Should the icinga alert for host down and related service alerts in icinga and alertmanager be silenced given it's known and there is a task? [15:55:35] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5398 [15:55:38] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw2448.codfw.wmnet with reason: hw failure [15:55:56] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw2448.codfw.wmnet with reason: hw failure [15:55:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5398 [15:56:03] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c526ca54-768b-461b-9bc7-1666a80b4153) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hw fail... [15:58:37] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15133 [16:00:04] eoghan, jelto, and arnoldokoth: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1600). [16:00:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15133 [16:00:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 139901 [16:02:22] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 139901 [16:04:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 327700 [16:04:39] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 327700 [16:07:42] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:58] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink topic model image on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/984203 (https://phabricator.wikimedia.org/T352834) (owner: 10AikoChou) [16:09:55] (03Merged) 10jenkins-bot: ml-services: update outlink topic model image on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/984203 (https://phabricator.wikimedia.org/T352834) (owner: 10AikoChou) [16:12:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on moss-be[2001-2003].codfw.wmnet with reason: not in service, being used to test a destructive cookbook [16:12:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on moss-be[2001-2003].codfw.wmnet with reason: not in service, being used to test a destructive cookbook [16:15:48] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:15:50] (03PS5) 10JMeybohm: Alert for containers with memory issues [alerts] - 10https://gerrit.wikimedia.org/r/984219 (https://phabricator.wikimedia.org/T256256) [16:18:31] (03CR) 10Dzahn: [V: 03+1 C: 03+2] query_service: force TLS for monitoring for search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/983491 (https://phabricator.wikimedia.org/T333510) (owner: 10Dzahn) [16:18:46] 10SRE, 10ops-eqiad, 10observability: InterfaceSpeedError - https://phabricator.wikimedia.org/T351862 (10Volans) [16:19:59] (03CR) 10Filippo Giunchedi: "I'm fairly sure this will result in a race between pyrra-filesystem and thanos-rule as you pointed out in https://phabricator.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [16:23:24] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:24:51] (03CR) 10Dzahn: "@Aklapper Everything that talks about "sshd" config seems to imply it only affects the git-ssh.wikimedia.org part but if it affects Diffus" [puppet] - 10https://gerrit.wikimedia.org/r/983958 (owner: 10Dzahn) [16:24:53] (03PS1) 10Muehlenhoff: graphite::production: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984248 [16:25:43] (03CR) 10Dzahn: "another one, probably the "ssh user" part can go but the vcs DB needs to stay" [puppet] - 10https://gerrit.wikimedia.org/r/983959 (owner: 10Dzahn) [16:27:30] (03CR) 10Dzahn: "soo.. the ssh part of vcs is baked into it in several places, like the ssh user, the extra IP, the sshd config options.. BUT as Brennen sa" [puppet] - 10https://gerrit.wikimedia.org/r/983957 (owner: 10Dzahn) [16:28:30] (03CR) 10Dzahn: "Is it right that we enable it ONLY in eqiad and not on role level? That seems like a thing we have to remember if we ever switch. Was ther" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [16:28:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984248 (owner: 10Muehlenhoff) [16:29:13] (03PS2) 10Elukey: admin_ng: set new Istio Service Entry for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/984214 (https://phabricator.wikimedia.org/T353622) [16:29:15] (03PS3) 10Elukey: ml-services: force HTTP in revert-risk agnostic staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984215 (https://phabricator.wikimedia.org/T353622) [16:29:17] (03PS1) 10Elukey: admin_ng: force coredns to resolve to A records in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/984250 (https://phabricator.wikimedia.org/T353622) [16:30:45] (03PS1) 10Muehlenhoff: aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251 [16:30:53] (03PS2) 10Muehlenhoff: aptrepo::staging: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984251 [16:31:18] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:34:36] (03CR) 10Brennen Bearnes: [C: 04-1] phabricator: remove vcs support, pt1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/983957 (owner: 10Dzahn) [16:34:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984251 (owner: 10Muehlenhoff) [16:36:03] (03CR) 10Dzahn: [C: 03+2] peopleweb: Configure tlsproxy::envoy::firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/984160 (owner: 10Muehlenhoff) [16:37:31] (03PS1) 10Muehlenhoff: an-web: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984252 [16:38:15] (03PS2) 10Muehlenhoff: an-web: Switch rsync service to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/984252 [16:39:56] (03CR) 10Dzahn: "alright, yea, so then it seems to me I can't remove the "enable vcs" parameter entirely but the part where it adds LVS and the extra IPs c" [puppet] - 10https://gerrit.wikimedia.org/r/983957 (owner: 10Dzahn) [16:41:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984252 (owner: 10Muehlenhoff) [16:44:15] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set new Istio Service Entry for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/984214 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [16:44:44] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/984161/966/people1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/984161 (owner: 10Muehlenhoff) [16:45:24] (03CR) 10Dzahn: [V: 03+1 C: 03+1] Switch peopleweb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/984161 (owner: 10Muehlenhoff) [16:47:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:49:18] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10VRiley-WMF) This has been completed, and I pulled all the cables that were not in use. Also, renamed cloudswift1001 and cloudswift1002 to cloudlb1001 and cloudlb1002 respectively. [16:49:31] (03PS3) 10Hnowlan: changeprop-jobqueue: move PublishStashedFile back to non-k8s jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/983216 (https://phabricator.wikimedia.org/T349796) [16:49:38] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10VRiley-WMF) 05Open→03Resolved [16:52:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:53:10] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:56:12] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) [16:57:24] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:00:04] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1700). nyaa~ [17:00:04] dwisehaupt: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:29] i am here. [17:02:40] dwisehaupt: oops, we both have meetings right now, really sorry about that -- can I come back to you in 30 minutes? [17:03:17] sure. no problem. no rush on this. [17:04:57] (03CR) 10Stevemunene: [C: 03+2] Airflow metrics configuration adjustement [puppet] - 10https://gerrit.wikimedia.org/r/984200 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [17:26:46] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) [17:29:38] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for WBrown (WMF) - https://phabricator.wikimedia.org/T353735 (10Dreamy_Jazz) I have updated the request to be for the `deployment` group and added extra rationale in the task description. [17:30:12] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "This isn't right, it would disable the entire sysuser vcs, in addition to ssh-related stuff." [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:31:18] dwisehaupt: hi! do you need the public and private commits merged in a specific order? [17:31:36] i'm assuming we need the private first since the public relies on it. [17:31:56] 👍 [17:31:58] although if they are done close to each other the chance of the puppet run failing is small. [17:32:17] that was my read too but I wanted to make sure there weren't other moving parts somewhere [17:32:35] cool thanks. no other restrictions i'm aware of at this point. [17:33:07] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "The reason this was done on DC level is that _as long as we set the extra IP for sshd_ we had to avoid setting that same IP on 2 hosts. Te" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:35:32] (03CR) 10Bking: [C: 03+2] wdqs: Add Accept: header to LDF endpoint check [puppet] - 10https://gerrit.wikimedia.org/r/984212 (https://phabricator.wikimedia.org/T353672) (owner: 10Bking) [17:36:30] (03PS2) 10Dzahn: phabricator: remove enable_vcs parameter set in eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) [17:36:59] okay, secrets are merged in private hieradata, moving on to the public patch [17:37:38] (03CR) 10RLazarus: [C: 03+2] Install community_civicrm on crm role [puppet] - 10https://gerrit.wikimedia.org/r/982914 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [17:38:10] dwisehaupt: puppet-merge is running -- when it's done, do you need me to run puppet anywhere, or are you good for that? [17:38:27] i'm good to do the puppet run. *crosses fingers* [17:38:40] rad, stand by [17:38:44] and done, go ahead [17:39:05] let me know if you need followups or reverts or anything, I'm still around [17:39:13] thanks for your patience earlier :) [17:39:19] ok. starting the run. [17:40:41] (03PS3) 10Dzahn: phabricator: move enable_vcs parameter from eqiad-only to role [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) [17:42:02] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: move enable_vcs parameter from eqiad-only to role [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:43:51] ok. initial puppet runs are clean. looking good so far. [17:44:11] (03CR) 10Dzahn: "this enables VCS in codfw: https://puppet-compiler.wmflabs.org/output/983955/968/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/983955 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [17:45:02] i will have a follow on question, probably for thursday's window. after i do some testing with ssh port forwarding, we will want to put this behind the cdn. [17:45:41] i know that will require a dns update and an update to hieradata/role/common/cache/text.yaml [17:46:09] just wondering if you know (or know where to point me) for how to set up the necessary connections. [17:47:14] I don't offhand, but folks in #wikimedia-traffic should be able to get you going -- ping me if you have any trouble getting in touch with someone [17:49:01] cool thanks. all the runs and deploy bits look good. i'll restore the db from testing and start working on the next steps [17:49:06] thanks for your help! [17:49:08] PROBLEM - Check systemd state on crm2001 is CRITICAL: CRITICAL - degraded: The following units failed: community_civicrm-cv-job-run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:09] great! glhf [17:49:14] ha. [17:49:21] of course that alerts as i say it. [17:49:32] i'll ack that until the setup is complete. :) [17:50:01] i'll also look at updating the crm alerts to go to fr-tech-ops [17:50:38] RECOVERY - Check systemd state on crm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:10] (03PS2) 10Dzahn: phabricator: remove support for separate git-ssh IP behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) [17:51:33] (03PS3) 10Kosta Harlan: WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 [17:51:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bullseye [17:53:20] (03PS3) 10Dzahn: phabricator: remove support for separate git-ssh IP behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) [17:55:36] dwisehaupt: take a look at hieradata/common/profile/trafficserver/backend.yaml that's where those are configured. there are a bunch of "target" and "replacement" lines. here is an example: [17:55:49] - type: map [17:55:50] target: http://people.wikimedia.org [17:55:50] replacement: https://peopleweb.discovery.wmnet [17:56:49] mutante: thanks! [17:56:52] so you would first get a foo.discovery.wmnet record, point that to your machine and then add one of those ^. [17:58:38] I saw you already set that "cache: pass" option in the other place. so CDN won't do caching and that should be it on the traffic side [17:59:11] (03CR) 10Bking: [C: 03+2] wdqs: remove unused CNAME [dns] - 10https://gerrit.wikimedia.org/r/983725 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [17:59:18] (03PS2) 10Bking: wdqs: remove unused CNAME [dns] - 10https://gerrit.wikimedia.org/r/983725 (https://phabricator.wikimedia.org/T352111) [17:59:25] (03CR) 10Bking: [V: 03+2] wdqs: remove unused CNAME [dns] - 10https://gerrit.wikimedia.org/r/983725 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [17:59:44] ok. i'll follow up on this after testing things out. thanks again for the pointers through this. [18:00:08] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1800) [18:02:29] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "no diff in compiler: https://puppet-compiler.wmflabs.org/output/983957/969/" [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:04:20] (03PS1) 10BryanDavis: dynamicproxy: Send interest-based advertising opt-out header [puppet] - 10https://gerrit.wikimedia.org/r/984255 (https://phabricator.wikimedia.org/T353589) [18:06:30] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testhost2001.codfw.wmnet with OS bullseye [18:07:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host testhost2001.codfw.wmnet with OS bullseye [18:17:32] (03CR) 10Majavah: [C: 03+2] "Ugh. I don't like how Google gets to come up with new and innovative ways to spy on people and we have to explicitely opt-out from them, b" [puppet] - 10https://gerrit.wikimedia.org/r/984255 (https://phabricator.wikimedia.org/T353589) (owner: 10BryanDavis) [18:22:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [18:25:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testhost2001.codfw.wmnet with reason: host reimage [18:27:49] (03PS1) 10Majavah: dynamicproxy: tweak security header handling [puppet] - 10https://gerrit.wikimedia.org/r/984258 (https://phabricator.wikimedia.org/T353589) [18:28:41] (03PS2) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:28:58] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@d275e4f]: Deploy latest DAG changes to Analytics Airflow instance [18:29:29] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@d275e4f]: Deploy latest DAG changes to Analytics Airflow instance (duration: 00m 31s) [18:29:44] !log mforns@deploy2002 Started deploy [analytics/refinery@28dccef]: Regular analytics weekly train [analytics/refinery@28dccefe] [18:30:29] 10SRE, 10ops-codfw: Degraded RAID on testhost2001 - https://phabricator.wikimedia.org/T353743 (10ops-monitoring-bot) [18:32:09] (03PS3) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:32:46] (03Abandoned) 10Dzahn: phabricator: remove vcs support, pt3 [puppet] - 10https://gerrit.wikimedia.org/r/983959 (owner: 10Dzahn) [18:35:17] (03PS4) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:38:05] (03PS5) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:38:52] (03PS6) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:39:02] !log mforns@deploy2002 Finished deploy [analytics/refinery@28dccef]: Regular analytics weekly train [analytics/refinery@28dccefe] (duration: 09m 18s) [18:39:23] !log mforns@deploy2002 Started deploy [analytics/refinery@28dccef] (thin): Regular analytics weekly train THIN [analytics/refinery@28dccefe] [18:39:29] !log mforns@deploy2002 Finished deploy [analytics/refinery@28dccef] (thin): Regular analytics weekly train THIN [analytics/refinery@28dccefe] (duration: 00m 06s) [18:39:44] !log mforns@deploy2002 Started deploy [analytics/refinery@28dccef] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@28dccefe] [18:40:17] (03PS7) 10Dzahn: phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) [18:40:54] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "noop in compiler: https://puppet-compiler.wmflabs.org/output/983958/970/" [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:43:01] !log mforns@deploy2002 Finished deploy [analytics/refinery@28dccef] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@28dccefe] (duration: 03m 16s) [18:43:51] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: remove support for separate git-ssh IP behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:43:59] (03CR) 10Brennen Bearnes: [C: 03+1] phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:44:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:48:55] !log mforns@deploy2002 Started deploy [analytics/refinery@28dccef] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@28dccefe] [18:49:01] !log mforns@deploy2002 Finished deploy [analytics/refinery@28dccef] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@28dccefe] (duration: 00m 05s) [18:56:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:56:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testhost2001.codfw.wmnet with OS bullseye [18:56:31] (03PS1) 10Cwhite: udp2log: tag logrotated mwlogs with yesterdays date [puppet] - 10https://gerrit.wikimedia.org/r/984228 (https://phabricator.wikimedia.org/T353221) [18:56:35] (03PS1) 10Ahmon Dancy: logspam: Consolidate Actor name can not be empty for 0 and... [puppet] - 10https://gerrit.wikimedia.org/r/984259 (https://phabricator.wikimedia.org/T307738) [18:59:33] 10SRE, 10ops-codfw: Degraded RAID on testhost2001 - https://phabricator.wikimedia.org/T353743 (10Papaul) 05Open→03Resolved a:03Papaul This was a false alert it is a new server that was half way installed. I just finished the install now so resolving this task for now. [18:59:47] (03CR) 10Brennen Bearnes: [C: 03+1] logspam: Consolidate Actor name can not be empty for 0 and... [puppet] - 10https://gerrit.wikimedia.org/r/984259 (https://phabricator.wikimedia.org/T307738) (owner: 10Ahmon Dancy) [19:00:05] dancy and brennen: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1900). Please do the needful. [19:00:09] o/ [19:00:12] o/ [19:02:18] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984261 (https://phabricator.wikimedia.org/T350086) [19:02:20] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984261 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:03:26] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984261 (https://phabricator.wikimedia.org/T350086) (owner: 10TrainBranchBot) [19:10:28] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.10 refs T350086 [19:10:46] T350086: 1.42.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T350086 [19:45:00] 10ops-codfw, 10DC-Ops, 10cloud-services-team: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746 (10Andrew) [19:52:28] (03PS1) 10Kosta Harlan: MediaModeration: Add dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984269 (https://phabricator.wikimedia.org/T353703) [19:52:51] (03PS4) 10Kosta Harlan: WIP: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) [19:53:36] (03PS5) 10Kosta Harlan: Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) [19:59:04] Who might be able to review / deploy a patch to operations/puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/984196? [20:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:08:35] kostajh: I'll have a look at yours in a moment, and for the future https://wikitech.wikimedia.org/wiki/Puppet_request_window is a good way to find reviewers for small patches like that [20:08:48] aha, forgot about that window. Thanks. [20:11:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:11:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:12:06] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: remove support for separate git-ssh IP behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:13:35] (03PS3) 10Herron: pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) [20:15:13] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/983957 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:15:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: remove sshd config for git-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:16:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:16:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [20:16:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:17:10] (03PS4) 10Herron: pyrra: reload pyrra-filesystem and thanos-rule on cfg change [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) [20:17:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.121 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:18] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/972/con" [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [20:19:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/983958 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [20:23:29] (03CR) 10Herron: [V: 03+1] pyrra: reload pyrra-filesystem and thanos-rule on cfg change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984220 (https://phabricator.wikimedia.org/T353691) (owner: 10Herron) [20:24:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/973/con" [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [20:26:16] kostajh: the puppet code looks good, but the script does not seem to be present in wmf.9? [20:26:32] it's in wmf.10 [20:26:48] and wmf.10 has not yet been deployed everywhere [20:26:58] right. the dblist is only testwiki, for now. [20:27:01] is that ok? [20:28:03] that dblist file does not exist at all? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/dblists/ [20:28:33] not yet. linked via the depends-on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/984269 [20:28:41] and deploying that in ~30 minutes or so. [20:31:11] ah, missed that. it needs to be deployed before the puppet patch can be merged [20:33:18] ack [20:38:36] (03PS1) 10Andrew Bogott: wmcs_backup_volumes: exclude wikiwho volumes from backups [puppet] - 10https://gerrit.wikimedia.org/r/984276 [20:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [20:42:24] (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_volumes: exclude wikiwho volumes from backups [puppet] - 10https://gerrit.wikimedia.org/r/984276 (owner: 10Andrew Bogott) [20:46:25] (03CR) 10BryanDavis: [C: 03+1] "Untested, but it seems like a reasonable change" [puppet] - 10https://gerrit.wikimedia.org/r/984258 (https://phabricator.wikimedia.org/T353589) (owner: 10Majavah) [20:46:45] (03CR) 10Majavah: [C: 03+2] dynamicproxy: tweak security header handling [puppet] - 10https://gerrit.wikimedia.org/r/984258 (https://phabricator.wikimedia.org/T353589) (owner: 10Majavah) [20:53:43] (03PS1) 10Ladsgroup: Disable listings extension in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) [20:55:30] (03PS2) 10Ladsgroup: Disable listings extension in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) [20:58:16] jouncebot: nowandnext [20:58:16] For the next 0 hour(s) and 1 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1900) [20:58:16] In 0 hour(s) and 1 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T2100) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T2100). [21:00:05] danisztls and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] hi [21:01:04] o/ [21:01:08] hi [21:02:27] if no one else is around, I can deploy [21:03:57] danisztls: is it typical to remove the entry for a wiki entirely from wmgUseQuickSurveys ? [21:04:40] put another way, is quicksurveys doing anything else on metawiki? or is it safe to disable? (setting "enabled: false" on that survey would be another option, afaict) [21:05:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984269 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [21:05:56] kostajh: let me check [21:06:17] (03Merged) 10jenkins-bot: MediaModeration: Add dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984269 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [21:06:30] kostajh: sure, in this case there were no surveys prior on meta [21:06:47] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:984269|MediaModeration: Add dblist (T353703)]] [21:07:06] T353703: Implement daily runs of updateMetrics on WMF wikis - https://phabricator.wikimedia.org/T353703 [21:07:16] danisztls: ok [21:08:16] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:984269|MediaModeration: Add dblist (T353703)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:27] danisztls: I'm asking because there are entries in wgQuickSurveysConfig (e.g. commonswiki) that don't have corresponding config in wgQuickSurveysConfig [21:08:52] !log kharlan@deploy2002 kharlan: Continuing with sync [21:09:20] but it seems safe enough [21:11:13] kostajh: yeah there are a few dangling configs from previous surveys [21:12:09] but typically I remove it after the survey is done [21:12:19] ok [21:12:26] waiting for the sync of the previous patch to finish [21:12:56] ok, thanks [21:14:01] it will not require testing [21:14:32] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:984269|MediaModeration: Add dblist (T353703)]] (duration: 07m 44s) [21:14:37] T353703: Implement daily runs of updateMetrics on WMF wikis - https://phabricator.wikimedia.org/T353703 [21:14:40] (03CR) 10Kosta Harlan: "I8dc02f7f5dff17b50d51ce7882fa13f4481cc67a is now deployed." [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [21:15:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983962 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:15:44] (03Merged) 10jenkins-bot: Undeploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983962 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:15:46] danisztls: as I'm unfamiliar with this, it would be nice if you could verify that it's undeployed, no errors occur when going through the steps where you'd normally see the survey, etc [21:16:08] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:983962|Undeploy Annual Plan Core Metrics survey (T351353)]] [21:16:12] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:16:30] kostajh: ok [21:17:34] !log kharlan@deploy2002 kharlan and dani: Backport for [[gerrit:983962|Undeploy Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:52] danisztls: ok, please have a look on mwdebug2001 or 2002 [21:18:44] danisztls: should I see the report on https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Reports/Core_Metrics_Q1 without this patch? doesn't seem to show up [21:19:09] kostajh: no you shouldn't because the page is blank :) [21:19:24] ah [21:20:00] the survey was deployed to a blank so when the report was published the survey would be working but the survey was abandoned in favor of using the discussion page to get feedback instead [21:20:25] so it is difficult to test but I'm not seeing errors or things out of the ordinary [21:20:27] danisztls: I'll move ahead with syncing this, ok? [21:20:32] kostajh: ok [21:20:35] yeah, logs look fine [21:20:38] !log kharlan@deploy2002 kharlan and dani: Continuing with sync [21:23:33] (03CR) 10Herron: verlib2: initial packaging (034 comments) [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [21:24:06] (03CR) 10Herron: [C: 04-2] "moving this to gitlab" [debs/python-verlib2] - 10https://gerrit.wikimedia.org/r/983468 (owner: 10Herron) [21:25:27] kostajh: thnks! [21:26:09] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:983962|Undeploy Annual Plan Core Metrics survey (T351353)]] (duration: 10m 00s) [21:26:12] !log UTC late deploys done [21:26:13] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:25] danisztls: no problem! have a nice one [21:29:49] (03PS3) 10Ladsgroup: Disable listings extension in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) [21:29:51] (03CR) 10Ladsgroup: [C: 03+2] Disable listings extension in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [21:30:32] (03PS1) 10DDesouza: Reorganize QuickSurveys config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984286 [21:30:49] (03PS1) 10DDesouza: Undeploy Annual Plan Core Metrics beta survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984287 (https://phabricator.wikimedia.org/T351353) [21:30:52] (03Merged) 10jenkins-bot: Disable listings extension in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984277 (https://phabricator.wikimedia.org/T253216) (owner: 10Ladsgroup) [21:31:27] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:984277|Disable listings extension in more wikis (T253216)]] [21:31:32] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [21:32:53] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:984277|Disable listings extension in more wikis (T253216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:33:34] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [21:39:10] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:984277|Disable listings extension in more wikis (T253216)]] (duration: 07m 42s) [21:39:15] T253216: Undeploy Extension:Listings from Wikimedia Production - https://phabricator.wikimedia.org/T253216 [21:42:59] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@d5ac513]: (no justification provided) [21:43:26] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@d5ac513]: (no justification provided) (duration: 00m 27s) [21:43:47] !log mforns@deploy2002 Started deploy [airflow-dags/wmde@d5ac513]: (no justification provided) [21:43:58] !log mforns@deploy2002 Finished deploy [airflow-dags/wmde@d5ac513]: (no justification provided) (duration: 00m 11s) [21:50:48] (03PS1) 10Houseblaster: InitialiseSettings.php: Allow thanking bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/984288 (https://phabricator.wikimedia.org/T341388) [21:57:00] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10bd808) > Define what it means for an account to be inactive. What is the thinking that leads to an assumption that "inactive" is actually a valid state for a Developer acc... [22:12:27] (03PS1) 10Ryan Kemper: wdqs: bring wdqs10[17-21] online [puppet] - 10https://gerrit.wikimedia.org/r/984289 (https://phabricator.wikimedia.org/T351671) [22:13:29] (03PS2) 10Ryan Kemper: wdqs: bring wdqs10[17-21] online [puppet] - 10https://gerrit.wikimedia.org/r/984289 (https://phabricator.wikimedia.org/T351671) [22:13:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984289 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:14:36] (03CR) 10Bking: [C: 03+1] wdqs: bring wdqs10[17-21] online [puppet] - 10https://gerrit.wikimedia.org/r/984289 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:19:19] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: bring wdqs10[17-21] online [puppet] - 10https://gerrit.wikimedia.org/r/984289 (https://phabricator.wikimedia.org/T351671) (owner: 10Ryan Kemper) [22:21:47] (03CR) 10Bking: [C: 03+2] wdqs: Monitor LDF endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:21:52] (03CR) 10Dreamy Jazz: [C: 03+1] Add MediaModeration module and support for running updateMetrics [puppet] - 10https://gerrit.wikimedia.org/r/984196 (https://phabricator.wikimedia.org/T353703) (owner: 10Kosta Harlan) [22:24:28] (SystemdUnitFailed) firing: (8) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:14] (SystemdUnitFailed) firing: (8) prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:23] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671 [22:26:28] T351671: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 [22:26:41] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671 [22:53:47] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:54:56] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:55:14] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [23:03:33] (JobUnavailable) firing: Reduced availability for job jmx_query_service_streaming_updater in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:20:00] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10bking) Forgive the drive-by comment, but I happened to see this one scroll by in IRC and I wanted to share my (possibly relevant?) experience fighting compromised accounts... [23:26:39] PROBLEM - WDQS SPARQL on wdqs1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:26:43] PROBLEM - WDQS SPARQL on wdqs1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:53:56] (03CR) 10Dzahn: [C: 03+2] logspam: Consolidate Actor name can not be empty for 0 and... [puppet] - 10https://gerrit.wikimedia.org/r/984259 (https://phabricator.wikimedia.org/T307738) (owner: 10Ahmon Dancy)