[00:06:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:11:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:16:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:18:36] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:41] (03PS2) 10Dreamy Jazz: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) [00:24:21] (03CR) 10CI reject: [V: 04-1] Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [00:25:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:32] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963985 [00:38:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963985 (owner: 10TrainBranchBot) [00:42:52] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:02] (03PS1) 10MusikAnimal: Enable UrlShortenerEnableQrCode on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) [00:43:46] (03CR) 10CI reject: [V: 04-1] Enable UrlShortenerEnableQrCode on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) (owner: 10MusikAnimal) [00:44:42] (03PS2) 10MusikAnimal: Enable UrlShortenerEnableQrCode on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) [00:45:56] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963985 (owner: 10TrainBranchBot) [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T348488 (10phaultfinder) [01:54:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:59:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0200) [02:07:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.30 [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964626 (https://phabricator.wikimedia.org/T347081) [02:07:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.30 [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964626 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [02:22:38] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.30 [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964626 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [02:37:21] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Frostly) Pinging @SLyngshede-WMF as clinic duty [02:38:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:13] (03CR) 10Samwilson: [C: 03+1] Enable UrlShortenerEnableQrCode on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) (owner: 10MusikAnimal) [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0300) [03:01:26] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964620 (https://phabricator.wikimedia.org/T347081) [03:01:28] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964620 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [03:02:12] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964620 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [03:02:42] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.30 refs T347081 [03:02:46] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [03:03:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:34] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:27:28] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:27:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:27:50] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:29:00] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:29:01] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:29:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:29:22] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:32:08] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:52:38] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.30 refs T347081 (duration: 49m 56s) [03:52:42] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [03:54:48] !log mwpresync@deploy2002 Pruned MediaWiki: 1.41.0-wmf.28 (duration: 02m 08s) [04:28:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:55:33] 10SRE-Access-Requests: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 (10Aklapper) [05:13:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:14:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:18:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:30:51] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [05:33:16] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [05:39:15] (03PS1) 10Ilias Sarantopoulos: ml-services: enable base CORS headers policy for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/964625 [05:40:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0600). [06:42:38] !log installing qemu security updates on bookworm [06:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:06] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::k8s: add k8s-pods-kserve config [puppet] - 10https://gerrit.wikimedia.org/r/964551 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [06:54:52] (03PS1) 10Muehlenhoff: aptrepo::rsync: Don't setup rsync for empty list of secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/964844 (https://phabricator.wikimedia.org/T331613) [06:57:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964844 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [06:57:59] (03PS2) 10Ilias Sarantopoulos: ml-services: enable base CORS headers policy for articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/964625 [06:58:40] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 222 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:05] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0700). [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] hi [07:04:02] (03CR) 10Elukey: "Should be ready for a review, lemme know :)" [alerts] - 10https://gerrit.wikimedia.org/r/964534 (owner: 10Elukey) [07:04:22] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:10] I will deploy my patch [07:06:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964575 (owner: 10Kosta Harlan) [07:07:32] (03Merged) 10jenkins-bot: ReportIncident: Set developer mode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964575 (owner: 10Kosta Harlan) [07:08:20] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:964575|ReportIncident: Set developer mode to false]] [07:09:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 57 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:09:42] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:964575|ReportIncident: Set developer mode to false]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:10:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [07:12:53] !log kharlan@deploy2002 kharlan: Continuing with sync [07:18:38] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:964575|ReportIncident: Set developer mode to false]] (duration: 10m 17s) [07:19:24] !log UTC morning deploys done [07:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T343198)', diff saved to https://phabricator.wikimedia.org/P52875 and previous config saved to /var/cache/conftool/dbconfig/20231010-072327-arnaudb.json [07:23:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [07:27:28] (03PS1) 10KartikMistry: Update cxserver to 2023-10-05-093231-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/964846 (https://phabricator.wikimedia.org/T344982) [07:29:16] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:34:41] I am delaying the MediaWiki train by roughly half an hour due to a schedule conflict. So instead of 8:00 UTC sharp, I will start it at 8:30 UTC (10:30 CEST) [07:35:37] (03PS1) 10Elukey: services: upgrade Docker image for eventstreams services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964848 (https://phabricator.wikimedia.org/T343511) [07:38:04] (03PS2) 10Elukey: services: upgrade Docker image for eventstreams services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964848 (https://phabricator.wikimedia.org/T343511) [07:38:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P52876 and previous config saved to /var/cache/conftool/dbconfig/20231010-073834-arnaudb.json [07:45:07] (03PS1) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [07:45:42] (03PS1) 10Kevin Bazira: ml-services: enable the uwsgi master process for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/964629 (https://phabricator.wikimedia.org/T347475) [07:46:15] 10SRE, 10Infrastructure-Foundations: Further enhancements for nftables support in profile::firewall - https://phabricator.wikimedia.org/T348498 (10MoritzMuehlenhoff) [07:47:01] (03CR) 10CI reject: [V: 04-1] C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [07:47:42] 10SRE, 10Infrastructure-Foundations: Monitoring check for nftables - https://phabricator.wikimedia.org/T348499 (10MoritzMuehlenhoff) [07:48:45] (03CR) 10Elukey: [C: 03+1] "We can definitely test it, I am a little bit worried about cpu usage, but I think that leaving 2 cpus if fine for the moment, we can refin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964629 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [07:51:20] (03PS2) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [07:53:17] (03CR) 10CI reject: [V: 04-1] C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [07:53:30] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: enable the uwsgi master process for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/964629 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [07:53:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P52877 and previous config saved to /var/cache/conftool/dbconfig/20231010-075340-arnaudb.json [07:54:26] (03Merged) 10jenkins-bot: ml-services: enable the uwsgi master process for rec-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/964629 (https://phabricator.wikimedia.org/T347475) (owner: 10Kevin Bazira) [07:56:52] (03PS1) 10Muehlenhoff: Add monitoring check for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [07:58:01] (03PS2) 10Muehlenhoff: Add monitoring check for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [07:58:43] (03PS3) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [08:00:04] hashar and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0800) [08:00:13] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:01:11] (03CR) 10CI reject: [V: 04-1] Add monitoring check for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [08:02:16] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/964534 (owner: 10Elukey) [08:04:22] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43969/console" [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [08:06:45] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10MatthewVernon) The original video file is gone; is there still an issue with the PDF collection? Thumbor has had problems handling PDFs (cf T337649), altern... [08:08:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T343198)', diff saved to https://phabricator.wikimedia.org/P52878 and previous config saved to /var/cache/conftool/dbconfig/20231010-080847-arnaudb.json [08:08:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [08:08:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:09:03] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [08:09:05] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:09:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:09:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T343198)', diff saved to https://phabricator.wikimedia.org/P52879 and previous config saved to /var/cache/conftool/dbconfig/20231010-080924-arnaudb.json [08:11:09] (03CR) 10Elukey: [C: 03+2] team-sre: make KubernetesAPILatency more lenient [alerts] - 10https://gerrit.wikimedia.org/r/964534 (owner: 10Elukey) [08:11:23] (03CR) 10Elukey: [C: 03+2] profile::prometheus::k8s: add k8s-pods-kserve config [puppet] - 10https://gerrit.wikimedia.org/r/964551 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [08:13:01] (03PS3) 10Muehlenhoff: Add monitoring check for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [08:23:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [08:24:41] !log wikitech-static: cleanup image archive directory: T348503 [08:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:45] T348503: wikitech-static is out of disk - https://phabricator.wikimedia.org/T348503 [08:26:37] (03PS1) 10Volans: locking: load also ~/.etcdrc for the running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) [08:27:16] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964853 (https://phabricator.wikimedia.org/T347081) [08:27:18] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964853 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:28:03] we are running the mediawiki train [08:28:34] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964853 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:31:23] (03CR) 10CI reject: [V: 04-1] locking: load also ~/.etcdrc for the running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:34:09] (03PS8) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [08:35:19] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.30 refs T347081 [08:35:23] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [08:37:05] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [08:37:15] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Antoine_Quhen) [08:38:24] (03PS1) 10Elukey: profile::prometheus::k8s: fix k8s-pods-kserve settings [puppet] - 10https://gerrit.wikimedia.org/r/964854 [08:38:30] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Antoine_Quhen) @Jelto done. wikitech username & email address checked. Thanks! [08:38:36] (03CR) 10Klausman: [C: 03+1] profile::prometheus::k8s: add k8s-pods-kserve config [puppet] - 10https://gerrit.wikimedia.org/r/964551 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [08:40:10] (03PS2) 10Volans: locking: load also ~/.etcdrc for the running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) [08:43:12] (03CR) 10Klausman: [C: 03+1] profile::prometheus::k8s: fix k8s-pods-kserve settings [puppet] - 10https://gerrit.wikimedia.org/r/964854 (owner: 10Elukey) [08:44:02] (03CR) 10Elukey: [C: 03+2] profile::prometheus::k8s: fix k8s-pods-kserve settings [puppet] - 10https://gerrit.wikimedia.org/r/964854 (owner: 10Elukey) [08:44:18] (03PS8) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [08:44:25] (03Abandoned) 10Jbond: install_server: add directory for host metadata [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [08:44:42] (03CR) 10Jbond: "updated ready for another pass" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [08:45:50] (03PS9) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [08:46:47] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [08:49:41] there are burst of errors in the MediaWiki PageViewInfo log bucket since ~ 8:00 so that is before the train [08:49:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:51:42] ahh Failed fetching http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/user/Mainstream_economics/daily/20230811/20231009: There was a problem during the HTTP request: 503 Service Unavailable [08:52:32] (03CR) 10Volans: [C: 03+2] locking: load also ~/.etcdrc for the running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:52:48] hashar: that's restbase [08:55:22] claime: so essentially we can "ignore" it or should I file a task for whatever issue is going on to be investigated? [08:55:31] (looks like there are less errors now [08:56:00] Trying to find graphs for it [08:56:21] But it looks like we don't graph 503s for metrics_pageviews [08:57:15] (03Merged) 10jenkins-bot: locking: load also ~/.etcdrc for the running user [software/spicerack] - 10https://gerrit.wikimedia.org/r/964852 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:00:37] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/964855 [09:00:40] (03CR) 10Jbond: [C: 04-1] "-1 is for the file location otherwise lgtm baring the comment about alertmanager" [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [09:01:01] claime: it's AQS actually [09:01:09] >_> [09:01:12] as in, restbase calls AQS 1.0 I think [09:01:19] ask hugh, he should know [09:04:35] (03CR) 10Jbond: [C: 03+1] "lgtm but see q inline" [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [09:05:05] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964857 [09:07:12] I'm seeing one 504 in logstash between 0800 and now for that path and that's it [09:08:16] (03PS4) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [09:08:49] That url looks like it's responding correctly from 2 different appservers (1 eqiad, 1 codfw) [09:09:54] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43970/console" [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:10:35] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/964855 (owner: 10Muehlenhoff) [09:10:59] (03PS1) 10FNegri: pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) [09:11:10] (03CR) 10Jbond: ci: add Gerrit ssh key to ssh_known_hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [09:11:25] (03CR) 10CI reject: [V: 04-1] pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:11:46] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964857 (owner: 10Volans) [09:12:05] (03PS1) 10Elukey: ml-services: add listener for mw-api in the rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) [09:12:08] (03CR) 10Volans: [C: 04-1] "Couple of leftovers but LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:12:14] (03PS2) 10Elukey: ml-services: add listener for mw-api in the rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) [09:12:57] (03PS5) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [09:14:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43971/console" [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:14:18] hashar: Do you have a logstash link with the errors? [09:14:33] Mediawiki train done [09:14:39] claime: let me look it up again :) [09:14:47] (03PS2) 10FNegri: pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) [09:15:08] there were some 503 from the http query [09:15:20] (03CR) 10CI reject: [V: 04-1] pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:15:34] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964857 (owner: 10Volans) [09:16:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964844 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [09:16:13] claime: https://logstash.wikimedia.org/goto/a4ec8e4e80f00b4c3bca9ee554531298 [09:16:24] thanks [09:16:37] 503 Service Unavailable from some localhost:6011 service [09:16:52] Yeah, that localhost service is the envoy listener for restbase [09:16:58] (03CR) 10Klausman: [C: 03+1] ml-services: add listener for mw-api in the rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey) [09:17:19] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add listener for mw-api in the rec-api-ng's config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/964859 (https://phabricator.wikimedia.org/T347475) (owner: 10Elukey) [09:18:17] It's intermittent I think because I'm trying the most recent examples and they don't error out [09:18:24] (03PS3) 10FNegri: pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) [09:18:50] (03CR) 10CI reject: [V: 04-1] pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:18:57] restbase logstash is of no help [09:19:08] (03PS1) 10Volans: Upstream release v7.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964860 [09:19:11] https://logstash.wikimedia.org/goto/8171270124a58c09fa4c2953dc7de679 [09:19:14] Thanks. [09:19:47] Ah! [09:19:50] One is useful [09:20:01] Socket hangup from aqs [09:20:19] (03PS4) 10FNegri: pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) [09:20:54] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:22:03] hnowlan: The cassandra-aqs logstash is not much use for this, is it? [09:22:14] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:22:25] (03PS1) 10Slyngshede: P:monitoring remove check_cpufreq check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/964861 (https://phabricator.wikimedia.org/T332764) [09:22:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964860 (owner: 10Volans) [09:23:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: Add monkey patching [puppet] - 10https://gerrit.wikimedia.org/r/963299 (https://phabricator.wikimedia.org/T314776) (owner: 10Jbond) [09:23:17] (03PS9) 10Jbond: late_command.sh: Add logic to read puppet_version from local file [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [09:23:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:24:08] (03CR) 10Volans: [C: 03+2] Upstream release v7.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964860 (owner: 10Volans) [09:24:45] claime: I haven't looked at that before unfortunately. aqs1 is basically restbase again so I can have a look at the service to see if I can find anything [09:24:54] (03CR) 10FNegri: [C: 03+2] pdns_server: rename privilege for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964858 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [09:25:16] hnowlan: only trace I found relevant in the restbase dash is https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2023.41?id=PQuuGIsB2F9ZGV9ikxvd [09:25:19] (03PS1) 10Slyngshede: P:monitoring Cleanup after cpu_freq removal. [puppet] - 10https://gerrit.wikimedia.org/r/964862 (https://phabricator.wikimedia.org/T332764) [09:25:44] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:53] There's something fishy, either the mw logs are lying in some way, or wtf: "Failed fetching http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/it.wikipedia.org/all-access/user/Glucosio/daily/20230811/20231009: * Error fetching URL: Could not resolve host: localhost" [09:28:06] (03Merged) 10jenkins-bot: Upstream release v7.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964860 (owner: 10Volans) [09:28:24] o_o [09:28:27] I can obviously curl that url just fine [09:30:05] (03CR) 10Volans: [C: 04-1] "Some minor issues to fix, the logic looks ok." [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:32:12] jouncebot: nowandnext [09:32:12] For the next 0 hour(s) and 27 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0800) [09:32:12] In 0 hour(s) and 27 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1000) [09:32:32] (03CR) 10Volans: [C: 03+1] "LGTM, reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:33:25] !log uploaded spicerack_7.4.1 to apt.wikimedia.org bullseye-wikimedia [09:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:20] hnowlan: The messages start at 07:56, and I can find nothing related around that time [09:35:23] (03CR) 10Filippo Giunchedi: "Thank you for looking into this! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:35:55] (03PS1) 10Slyngshede: P:monitoring Puppet runs are now monitored by Prometheus. [puppet] - 10https://gerrit.wikimedia.org/r/964869 (https://phabricator.wikimedia.org/T332764) [09:36:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964861 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:40:58] claime: I can't find anything useful in aqs itself (it's pretty quiet log-wise) but that error seems pretty damning (and baffling) [09:41:26] (03PS1) 10Ladsgroup: Set pagelinks migration stage of cebwiki to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964870 (https://phabricator.wikimedia.org/T345732) [09:41:57] (03PS6) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [09:42:02] there's an initial burst at 07:49 for cawiki exclusively but that doesn't line up with anything [09:42:39] I can do a call for the url above from mw-on-k8s [09:43:01] jouncebot: nowandnext [09:43:01] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0800) [09:43:02] In 0 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1000) [09:43:15] claime: can I deploy stuff? [09:43:44] (03PS1) 10Majavah: wiki-replicas: Update IP address for cloudcontrol1006 [puppet] - 10https://gerrit.wikimedia.org/r/964871 (https://phabricator.wikimedia.org/T347381) [09:43:52] Amir1: I mean, would it make things worse? idk [09:44:16] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [09:44:45] I don't think it would [09:45:31] hnowlan: I can't reproduce from a mwdebug host through mwscript eval.php either [09:45:48] (03CR) 10Slyngshede: C:prometheus::node_dpkg_success (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:46:00] Amir1: Then honestly go ahead [09:46:12] awesome. thanks. [09:46:16] (03CR) 10Ladsgroup: [C: 03+2] Set pagelinks migration stage of cebwiki to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964870 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [09:46:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964870 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [09:46:40] stupid question but why is this just happening for jobrunners? I don't really know what PageViewInfo is [09:46:57] (03Merged) 10jenkins-bot: Set pagelinks migration stage of cebwiki to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964870 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [09:47:22] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:964870|Set pagelinks migration stage of cebwiki to write both (T345732)]] [09:47:26] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [09:47:43] hnowlan: No, it's happening mostly for jobrunners, but there are similar errors for other servergroups https://logstash.wikimedia.org/goto/82a46c503b5ab06d1bcedf6fa2bd5b47 [09:48:27] Huh they don'thave the same errors [09:48:34] Not jobrunners get 503 [09:48:40] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:964870|Set pagelinks migration stage of cebwiki to write both (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:48:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:48:47] jobrunners get the weird localhost resolve issue [09:49:21] jobrunners are getting thousands more messages than no-jobrunners too [09:49:29] but good point though [09:50:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:50:53] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:52:54] claime: uhhh wut https://logstash.wikimedia.org/goto/e21e69d44d71ba4d4434c8c7724a777a [09:52:58] this happens every day? [09:53:19] o_O [09:53:27] different errors though maybe? just 503s [09:53:43] does this align with some kind of data import to cassandra I wonder [09:54:12] No, even if I take the data from 04/10, there are both 503s and the weird could not resolve localhost [09:54:45] re: import to cassandra I wouldn't know unfortunately [09:54:57] I've asked in -analytics [09:55:08] (03PS10) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [09:55:47] thanks [09:55:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:56:08] (03CR) 10Jbond: "thanks for the feedback ill perform a re-imaged of a puppet5 to puppet5 to ensure nothing is broken but i cant test the other pathwasy unt" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [09:56:32] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:964870|Set pagelinks migration stage of cebwiki to write both (T345732)]] (duration: 09m 10s) [09:56:34] At least with this I think we're at bad-but-not-critical [09:56:36] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [09:56:36] jouncebot: nowandnext [09:56:36] For the next 0 hour(s) and 3 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T0800) [09:56:36] In 0 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1000) [09:57:01] erk, we got a peak of close to 900k errors on 27/09 [09:57:08] (03PS5) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) [09:58:42] (03CR) 10Filippo Giunchedi: C:prometheus::node_dpkg_success (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [09:59:26] claime: doesn't seem like it's import-related (at least as far as timing is concerned) [09:59:33] (03CR) 10Jelto: gitlab: use one sshkey for gitlab and remove suffix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1000) [10:00:24] (03CR) 10Jbond: late_command.sh: Add logic to read puppet_version from local file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [10:00:26] (03PS7) 10Slyngshede: C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) [10:00:28] (03CR) 10Jbond: [C: 03+2] late_command.sh: Add logic to read puppet_version from local file [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [10:00:34] (03PS10) 10Jbond: late_command.sh: Add logic to read puppet_version from local file [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [10:00:42] y'all deploying/breaking things currently? :-) I'd like to +2 a beta-only config change [10:01:12] (03CR) 10Slyngshede: C:prometheus::node_dpkg_success (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [10:01:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [10:01:48] TheresNoTime: I'm not, we're debugging a strange issue with restbase/aqs, and I think Amir1's done with his backport [10:01:52] TheresNoTime: It's not breaking, It's testing how things react when not everything goes to plan [10:02:24] I am done ^_^ [10:02:46] :) going to quickly get that beta-only change done then [10:03:20] (03PS11) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [10:03:43] (03PS1) 10Majavah: P:spicerack: install python3-defusedxml on cloudcmin hosts [puppet] - 10https://gerrit.wikimedia.org/r/964873 [10:03:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) (owner: 10MusikAnimal) [10:04:13] hnowlan: looks like it's been happening more since 16/08 [10:04:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [10:04:35] (03Merged) 10jenkins-bot: Enable UrlShortenerEnableQrCode on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964611 (https://phabricator.wikimedia.org/T348487) (owner: 10MusikAnimal) [10:05:08] * TheresNoTime done [10:05:24] It started on 12/07 from what I can see [10:05:43] and it's been happening at varied levels ever since [10:06:23] bbiab since it's not urgent [10:06:34] ack [10:08:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:09:45] (03PS1) 10Muehlenhoff: idp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964874 [10:10:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43972/console" [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [10:16:19] (03CR) 10Jelto: [C: 03+2] gitlab: use one sshkey for gitlab and remove suffix [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:16:43] (03CR) 10Jelto: [C: 03+2] gitlab: use one sshkey for gitlab and remove suffix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [10:20:04] (03PS12) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [10:25:14] (03PS13) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [10:25:36] (03CR) 10Ladsgroup: [C: 03+1] "Generally, looks okay but you need to deploy them yourself." [puppet] - 10https://gerrit.wikimedia.org/r/964871 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [10:30:17] (03PS3) 10Dreamy Jazz: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) [10:31:00] (03CR) 10CI reject: [V: 04-1] Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [10:33:13] (03PS14) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [10:35:52] (03CR) 10EoghanGaffney: [C: 03+1] gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [10:36:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [10:37:20] (03PS4) 10Dreamy Jazz: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) [10:37:30] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (36617 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [10:39:09] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10hashar) The leftover files under `/home` got cleaned up: ` $ ssh pcc-worker1002.puppet-diffs.eqiad1.wikimedia.cloud df -ih / Filesystem... [10:41:52] (03PS5) 10Dreamy Jazz: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) [10:42:55] 10ops-codfw, 10decommission-hardware: decommission ores200*.codfw.wmnet - https://phabricator.wikimedia.org/T348514 (10klausman) [10:43:10] 10ops-eqiad, 10decommission-hardware: decommission ores100*.eqiad.wmnet - https://phabricator.wikimedia.org/T348515 (10klausman) [10:43:38] 10ops-codfw, 10decommission-hardware: decommission ores200*.codfw.wmnet - https://phabricator.wikimedia.org/T348514 (10klausman) [10:43:48] 10ops-eqiad, 10decommission-hardware: decommission ores100*.eqiad.wmnet - https://phabricator.wikimedia.org/T348515 (10klausman) [10:46:16] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ores200*.codfw.wmnet - https://phabricator.wikimedia.org/T348514 (10klausman) [10:46:18] 10SRE, 10ops-codfw, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10klausman) [10:46:44] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission ores100*.eqiad.wmnet - https://phabricator.wikimedia.org/T348515 (10klausman) [10:47:08] (03Abandoned) 10Hashar: Lower trickle_fsync_interval to 8mb [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) (owner: 10GWicke) [10:47:26] 10SRE, 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman) [10:47:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [10:48:49] (03CR) 10Jbond: [C: 03+1] idp: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [10:50:35] (03CR) 10FNegri: [C: 03+1] "LGTM but I would wait for Volans to double check." [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [10:51:32] (03CR) 10Slyngshede: [C: 03+2] P:monitoring remove check_cpufreq check from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/964861 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [10:52:39] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [10:52:57] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [10:54:08] (03CR) 10Slyngshede: [C: 03+2] C:prometheus::node_dpkg_success [puppet] - 10https://gerrit.wikimedia.org/r/964850 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [10:54:38] (03PS2) 10Kamila Součková: benthos/mw_accesslog_metrics: add cluster label [puppet] - 10https://gerrit.wikimedia.org/r/962041 [10:54:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T343198)', diff saved to https://phabricator.wikimedia.org/P52880 and previous config saved to /var/cache/conftool/dbconfig/20231010-105443-arnaudb.json [10:54:48] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [10:54:51] (03PS1) 10EoghanGaffney: [ci/firewall] Add cumin+deploy hosts to CI http allow list [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) [10:55:14] (03CR) 10Clément Goubert: [C: 03+1] benthos/mw_accesslog_metrics: add cluster label [puppet] - 10https://gerrit.wikimedia.org/r/962041 (owner: 10Kamila Součková) [10:57:35] (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: add cluster label [puppet] - 10https://gerrit.wikimedia.org/r/962041 (owner: 10Kamila Součková) [10:58:41] (03PS2) 10Slyngshede: P:monitoring Cleanup after cpu_freq removal. [puppet] - 10https://gerrit.wikimedia.org/r/964862 (https://phabricator.wikimedia.org/T332764) [11:02:21] (03PS1) 10Slyngshede: P:monitoring absent dpkg monitoring from icinga. [puppet] - 10https://gerrit.wikimedia.org/r/964882 (https://phabricator.wikimedia.org/T332764) [11:03:41] (03CR) 10Slyngshede: "Forgot to absent the nrpe::monitor_service for dpkg." [puppet] - 10https://gerrit.wikimedia.org/r/964882 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [11:04:23] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:06:00] PROBLEM - PHP opcache health on mw2427 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:06:28] Huh [11:08:24] That seems... not true https://grafana.wikimedia.org/goto/IMSBcoMSz?orgId=1 [11:09:12] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:09:25] ok now I'm worried [11:09:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P52882 and previous config saved to /var/cache/conftool/dbconfig/20231010-110950-arnaudb.json [11:10:32] RECOVERY - PHP opcache health on mw2427 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:11:16] Ooookay [11:11:30] slyngs: I launched a recheck on mw2427 [11:11:38] Aah, okay [11:12:00] And my dashboard was on the wrong appserver of course [11:12:22] Yeah, they are both jobrunners [11:13:16] And yeah their opcache is pretty used up [11:13:44] Something happened with a deploy at 0345 last night [11:14:06] https://grafana.wikimedia.org/goto/Cs83cTMIz?orgId=1 [11:15:25] Yeah that's the train deployment [11:15:36] hashar: ^ [11:16:11] A few of the others have similar graphs, the opcache just isn't full yet [11:16:12] Something in the train is affecting opcache usage something hard [11:16:44] I think we need to rollback and find out [11:18:18] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Pull in local copy of Codex. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/964412 (owner: 10Slyngshede) [11:18:27] (03PS2) 10Muehlenhoff: idp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964874 [11:18:33] (03CR) 10Muehlenhoff: idp: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [11:21:46] PROBLEM - PHP opcache health on mw2282 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:21:57] joe: Sorry to ping you in the middle of SRECon but I need an adult [11:22:01] :p [11:22:25] claime: about what? [11:22:33] joe: opcache full following train [11:22:43] oh, where? just one machine? [11:22:49] That's 3 now [11:22:56] so jobrunners I gather? [11:22:59] yep [11:23:30] joe: pattern is this https://grafana.wikimedia.org/goto/Cs83cTMIz?orgId=1 [11:23:34] so yeah, the immediate solution is to do a rolling restart [11:23:46] PROBLEM - PHP opcache health on mw2446 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:23:50] ok [11:23:52] on it [11:24:26] but the issue is baffling a bit [11:24:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P52883 and previous config saved to /var/cache/conftool/dbconfig/20231010-112456-arnaudb.json [11:25:02] (03CR) 10Volans: "The patch is technically correct, but I left some questions inline on how we want to treat the more general problem." [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [11:25:03] claime: the thing to check is if the periodic restarts are working or not [11:26:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [11:26:16] PROBLEM - PHP opcache health on mw2281 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:26:17] joe: Looking at one jobrunner it won't run until tomorrow [11:26:39] sudo cookbook sre.mediawiki.restart-appservers --datacenters codfw --clusters jobrunner -p 10 -- php7.4-fpm [11:26:43] looks good ? [11:27:07] Haven't I restarted fpm has part of the train deploy? [11:27:18] (03CR) 10Volans: sre.ganeti.addnode: Switch firewall check use an IP address (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/963951 (owner: 10Muehlenhoff) [11:28:10] over 15 days it looks like it is a regular pattern ( https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-site=All&var-cluster=jobrunner&var-node=mw2427&var-php_version=proxy:unix:%2Frun%2Fphp%2Ffpm-www.%2A&from=now-15d&to=now&viewPanel=96 ) [11:28:58] in the train there is a step to clean up old versions which deletes the files from disk, I guess that should also trigger a garbage collection (or whatever the term) of the php op cache [11:29:08] Hmm I'm not sure how the php-fpm restart from scap works but on one jobrunner I'm checking php-fpm master process has been running since 28/09 [11:29:47] Anyways, I'm launching the php-fpm roll restart on jobrunners codfw [11:29:54] 10SRE, 10SRE-Access-Requests: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 (10MoritzMuehlenhoff) 05Open→03Stalled @gehel: When the list of applications is finalised, please update the task description, for now I'm setting this to S... [11:29:56] !log cgoubert@cumin1001 START - Cookbook sre.mediawiki.restart-appservers [11:30:16] well I guess the jobrunners don't get a restart :/ [11:30:28] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:30:29] yeah, guess they don't [11:30:34] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.mediawiki.restart-appservers (exit_code=0) [11:30:52] RECOVERY - PHP opcache health on mw2281 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:30:52] RECOVERY - PHP opcache health on mw2282 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:31:05] And since the periodic restart runs once every 24h, it didn't catch the opcache being too full, so didn't autorestart php-fpm [11:31:22] RECOVERY - PHP opcache health on mw2446 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:31:45] I'll try and keep an eye on it to see if it fills back up, which would indicate an issue with something in the train [11:31:57] If not they just needed a restart I guess [11:32:29] I'm gonna do eqiad just as a precaution [11:32:54] !log cgoubert@cumin1001 START - Cookbook sre.mediawiki.restart-appservers [11:32:54] 10SRE, 10Infrastructure-Foundations, 10serviceops: etcd increased QGET traffic since January 2023 - https://phabricator.wikimedia.org/T348525 (10Volans) p:05Triage→03Medium [11:33:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.mediawiki.restart-appservers (exit_code=0) [11:33:50] !log installed spicerack 7.4.1 on the cumin hosts [11:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:20] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43973/console" [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [11:34:43] (03CR) 10Muehlenhoff: Add monitoring check for nftables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:37:02] (03CR) 10Hnowlan: [C: 03+2] thumbor: add imagemagick policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [11:37:54] (03Merged) 10jenkins-bot: thumbor: add imagemagick policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [11:38:57] (03PS4) 10Muehlenhoff: Add monitoring check for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [11:39:23] There's something fishy with the dpkg check [11:40:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T343198)', diff saved to https://phabricator.wikimedia.org/P52884 and previous config saved to /var/cache/conftool/dbconfig/20231010-114002-arnaudb.json [11:40:05] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [11:40:10] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:40:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [11:40:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52885 and previous config saved to /var/cache/conftool/dbconfig/20231010-114024-arnaudb.json [11:41:49] Ah that's probably because the check was removed and is not yet removed from icing [11:41:50] a [11:46:00] (03PS1) 10Slyngshede: P:monitoring remove systemd check [puppet] - 10https://gerrit.wikimedia.org/r/964886 (https://phabricator.wikimedia.org/T332764) [11:47:43] claime: https://gerrit.wikimedia.org/r/c/operations/puppet/+/964882 [11:48:06] (03CR) 10Clément Goubert: [C: 03+1] P:monitoring absent dpkg monitoring from icinga. [puppet] - 10https://gerrit.wikimedia.org/r/964882 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [11:48:20] Thanks :-) [11:48:29] (03CR) 10Slyngshede: [C: 03+2] P:monitoring absent dpkg monitoring from icinga. [puppet] - 10https://gerrit.wikimedia.org/r/964882 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [11:48:29] np [11:48:42] I just glanced at my karma favicon and it said 1k+ alerts [11:48:54] Then rabbitholed x) [11:49:09] That's a lot of alerts :-) [11:50:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:51:22] Let's see if it's sufficient to remove the check from Icinga or if I need to add everything back and remove it correctly [11:52:47] slyngs: Error: /Stage[main]/Profile::Monitoring/Nrpe::Plugin[check_dpkg]/File[/usr/local/lib/nagios/plugins/check_dpkg]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/monitoring/check_dpkg.sh [11:52:59] Then it moves on [11:53:25] Yeah, I removed that script, but forgot to absent the service [11:53:26] I'll do another run to see if the error goes away [11:53:36] (it should) [11:53:41] (tm) [11:53:41] the source attribute needs to be removed from the nrpe::plugin definition [11:53:56] yeah [11:54:28] (WidespreadPuppetFailure) firing: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:54:46] the problem is that the nrpe config gets removed when puppet runs on a node, but the icinga check only gets removed on the next puppet run on the alert servers, so you're going to have some time when icinga checks will fail, but that'll go away on its own [11:56:14] before the last patch it had removed the script, but the config was still there because the nrpe::monitor_service wasn't absented, so everything goes unknown and it doesn't go away with a run on the alerting host [11:56:34] first it = puppet [11:58:22] (03PS1) 10Slyngshede: P:monitoring handle Puppet error on missing source. [puppet] - 10https://gerrit.wikimedia.org/r/964887 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1200) [12:00:29] taavi: claime https://gerrit.wikimedia.org/r/c/operations/puppet/+/964887 [12:00:41] (03CR) 10Clément Goubert: [C: 03+1] P:monitoring handle Puppet error on missing source. [puppet] - 10https://gerrit.wikimedia.org/r/964887 (owner: 10Slyngshede) [12:01:25] taavi: This is what you meant correct? https://gerrit.wikimedia.org/r/c/operations/puppet/+/964887 [12:01:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [12:02:19] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:02:27] (03CR) 10Slyngshede: [C: 03+2] P:monitoring handle Puppet error on missing source. [puppet] - 10https://gerrit.wikimedia.org/r/964887 (owner: 10Slyngshede) [12:04:48] WidespreadPuppetFailure obviously caused by this, for anyone wondering [12:06:21] slyngs: fix ok :) [12:06:27] I think so [12:08:21] (03PS2) 10Majavah: P:spicerack: parametrize cookbook dependencies [puppet] - 10https://gerrit.wikimedia.org/r/964873 [12:08:42] slyngs: sorry I missed those pings, yes that's what I meant [12:09:17] Thanks, and thank you for pointing out the error :-) [12:10:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43974/console" [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [12:11:18] (03CR) 10Majavah: [V: 03+1] P:spicerack: parametrize cookbook dependencies (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [12:13:44] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964889 (https://phabricator.wikimedia.org/T344233) [12:16:47] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo::rsync: Don't setup rsync for empty list of secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/964844 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [12:17:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/964869 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [12:18:45] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [12:19:15] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1100.eqiad.wmnet'] [12:21:06] variable @min-width-desktop-wide is undefined in file /srv/mediawiki/php-1.41.0-wmf.30/skins/Vector/skinStyles/ext.echo.styles.alert.less in ext.echo.styles.alert.less on line 22, column 23 [12:21:08] joy :) [12:24:47] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route edit-,editor- and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/964044 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [12:25:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:26:14] (03Merged) 10jenkins-bot: rest-gateway: route edit-,editor- and page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/964044 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [12:27:41] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [12:28:19] (03CR) 10Jbond: [C: 03+1] P:monitoring Cleanup after cpu_freq removal. [puppet] - 10https://gerrit.wikimedia.org/r/964862 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [12:30:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [12:31:09] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:spicerack: parametrize cookbook dependencies [puppet] - 10https://gerrit.wikimedia.org/r/964873 (owner: 10Majavah) [12:32:54] (03CR) 10Volans: "LGTM, couple of final nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:34:01] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:34:28] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:36:09] (03PS1) 10Jbond: sre.hardware.upgrade-cookbook: check we get drivers from dell [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 [12:36:37] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [12:37:51] claime: Puppet has been unbroken [12:37:56] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [12:38:11] (03CR) 10Volans: sre.hardware.upgrade-cookbook: check we get drivers from dell (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 (owner: 10Jbond) [12:38:13] 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): HDD failure in cloudvirt2004-dev - https://phabricator.wikimedia.org/T348531 (10RhinosF1) [12:39:40] (03CR) 10Slyngshede: [C: 03+2] P:monitoring Cleanup after cpu_freq removal. [puppet] - 10https://gerrit.wikimedia.org/r/964862 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [12:39:52] (03CR) 10Jbond: [C: 03+1] idp: Avoid Ferm-specific syntax (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [12:43:28] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43975/console" [puppet] - 10https://gerrit.wikimedia.org/r/964869 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [12:44:27] 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): HDD failure in cloudvirt2004-dev - https://phabricator.wikimedia.org/T348531 (10fnegri) [12:44:52] (03PS2) 10Jbond: sre.hardware.upgrade-cookbook: check we get drivers from dell [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 (https://phabricator.wikimedia.org/T348036) [12:44:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43976/console" [puppet] - 10https://gerrit.wikimedia.org/r/964869 (https://phabricator.wikimedia.org/T332764) (owner: 10Slyngshede) [12:45:00] (03CR) 10Jbond: sre.hardware.upgrade-cookbook: check we get drivers from dell (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 (https://phabricator.wikimedia.org/T348036) (owner: 10Jbond) [12:47:41] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 (https://phabricator.wikimedia.org/T348036) (owner: 10Jbond) [12:48:27] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-cookbook: check we get drivers from dell [cookbooks] - 10https://gerrit.wikimedia.org/r/964890 (https://phabricator.wikimedia.org/T348036) (owner: 10Jbond) [12:48:39] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) [12:50:41] (03CR) 10Ssingh: [C: 03+1] "Thanks! Once we merge this, we can rollout the ns1 change today." [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney) [12:57:10] (03PS15) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [12:57:29] (03PS3) 10Muehlenhoff: idp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964874 [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1300) [13:00:05] Urbanecm: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) This is not urgent and can wait a few days if necessary. [13:01:07] (03CR) 10Ayounsi: "Lgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [13:01:12] (03CR) 10Ayounsi: [C: 03+1] Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [13:02:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964874 (owner: 10Muehlenhoff) [13:02:18] !log fnegri@cumin1001 START - Cookbook sre.dns.netbox [13:02:32] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) [13:05:10] i'm here [13:05:14] but didn't see the ping [13:05:16] let's deploy! [13:05:28] (03PS2) 10Urbanecm: growth: Enable section-image recommendations on 10 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) [13:05:32] (03CR) 10Urbanecm: [C: 03+2] growth: Enable section-image recommendations on 10 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) (owner: 10Urbanecm) [13:06:09] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10ayounsi) Some of our transits like Lumen use MEDs so we need to make sure that a global knob doesn't impact those negatively. Another idea is to use BG... [13:06:16] (03Merged) 10jenkins-bot: growth: Enable section-image recommendations on 10 new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960545 (https://phabricator.wikimedia.org/T345940) (owner: 10Urbanecm) [13:07:18] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:960545|growth: Enable section-image recommendations on 10 new wikis (T345940)]] [13:07:30] T345940: Section-level "add an image" task: Scale to all Wikipedias that have the Article-level "add an image" task - https://phabricator.wikimedia.org/T345940 [13:08:30] (03PS1) 10Elukey: ml-services: add kserve annotations to isvc services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964899 (https://phabricator.wikimedia.org/T348456) [13:08:38] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:960545|growth: Enable section-image recommendations on 10 new wikis (T345940)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:15] (03PS1) 10Muehlenhoff: Add dummy keytabs for apt1002/apt2002 [labs/private] - 10https://gerrit.wikimedia.org/r/964900 (https://phabricator.wikimedia.org/T331613) [13:09:40] (03CR) 10Klausman: [C: 03+1] ml-services: add kserve annotations to isvc services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964899 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [13:11:30] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:11:38] (03PS2) 10Urbanecm: cswiki: Remove engineer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963843 (https://phabricator.wikimedia.org/T348279) [13:11:41] (03CR) 10Urbanecm: [C: 03+2] cswiki: Remove engineer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963843 (https://phabricator.wikimedia.org/T348279) (owner: 10Urbanecm) [13:12:20] (03CR) 10Elukey: [C: 03+2] ml-services: add kserve annotations to isvc services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964899 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [13:12:22] (03Merged) 10jenkins-bot: cswiki: Remove engineer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963843 (https://phabricator.wikimedia.org/T348279) (owner: 10Urbanecm) [13:13:57] (03PS1) 10Jforrester: ExtensionDistributor: Add REL1_41 as the development snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964910 (https://phabricator.wikimedia.org/T346929) [13:15:28] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T348488 (10Papaul) 05Open→03Resolved a:03Papaul Same server we already worked on this [13:15:46] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:16:42] 10SRE, 10ops-codfw, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10Papaul) a:03Jhancock.wm [13:16:52] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:17:04] some deployment spam incoming, apologies :) [13:17:17] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:960545|growth: Enable section-image recommendations on 10 new wikis (T345940)]] (duration: 09m 59s) [13:17:21] T345940: Section-level "add an image" task: Scale to all Wikipedias that have the Article-level "add an image" task - https://phabricator.wikimedia.org/T345940 [13:17:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:18:02] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:963843|cswiki: Remove engineer group (T348279)]] [13:18:06] T348279: Disable Engineer (technical administrator) user group @cswiki - https://phabricator.wikimedia.org/T348279 [13:18:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10ayounsi) Thanks for the task and feedback. If the issue is abuse from a limited number of providers (like in {T163312} it seems better to filter out that kin... [13:19:21] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:963843|cswiki: Remove engineer group (T348279)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:19:54] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:19:54] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:20:09] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:21:17] (03PS2) 10Urbanecm: Growth: Enable Welcome survey user research for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964013 (https://phabricator.wikimedia.org/T342353) [13:21:28] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable Welcome survey user research for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964013 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [13:21:39] * Lucas_WMDE also here now but probably not needed ^^ [13:22:05] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:22:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:22:31] Lucas_WMDE: if you really want to deploy something, i'm not opposed to that. but planning on finishing my patches :)) [13:22:50] nah, I don’t think I have anything to deploy ^^ [13:22:56] (03Merged) 10jenkins-bot: Growth: Enable Welcome survey user research for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964013 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [13:23:17] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:23:50] okay :). i originally meant my patch though (as in, if you really want to deploy but don't have a patch) :D [13:24:04] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:24:43] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:25:26] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:963843|cswiki: Remove engineer group (T348279)]] (duration: 07m 24s) [13:25:30] T348279: Disable Engineer (technical administrator) user group @cswiki - https://phabricator.wikimedia.org/T348279 [13:26:09] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:26:51] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:964013|Growth: Enable Welcome survey user research for enwiki (T342353)]] [13:26:59] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [13:27:32] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964889 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [13:27:51] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [13:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [13:28:10] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:964013|Growth: Enable Welcome survey user research for enwiki (T342353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:30] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964889 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [13:29:41] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:31:39] urbanecm: ah, okay :D [13:31:43] I didn’t know how many patches you had ^^ [13:31:49] this is the last one :D [13:31:50] but now I’m in a meeting, so keep going w^ [13:31:51] *^^ [13:31:57] happy meeting then :) [13:32:08] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:32:41] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:33:14] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:34:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:35:08] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:36:37] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:37:33] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:39:20] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:40:11] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:964013|Growth: Enable Welcome survey user research for enwiki (T342353)]] (duration: 13m 19s) [13:40:14] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [13:40:16] * urbanecm done [13:42:27] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:44:00] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:44:45] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:47:12] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: rename site label [puppet] - 10https://gerrit.wikimedia.org/r/964914 [13:48:06] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:49:16] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:49:29] 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10Jclark-ctr) @MoritzMuehlenhoff Replaced drive. server still has an error for fan but it is not overheating Idrac is showing an error for fan. System Board Fan1A Standard Performance N/A 0 [13:50:39] (03PS1) 10AikoChou: ml-services: test kserve batcher to revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) [13:50:52] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:52:00] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:52:24] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:52:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:53:10] (03PS1) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [13:54:15] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:54:17] 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10Jclark-ctr) a:03Jclark-ctr [13:54:30] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:54:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:57] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [13:54:59] (03PS16) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [13:55:01] (03PS1) 10Jbond: sre.hosts.reimage: remove the call to destroy [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) [13:56:25] (03PS2) 10Kamila Součková: benthos/mw_accesslog_metrics: rename the dc label [puppet] - 10https://gerrit.wikimedia.org/r/964914 [13:57:31] (03PS1) 10Ssingh: hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) [13:57:37] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:57:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1063.eqiad.wmnet with OS bullseye [13:57:56] (03Abandoned) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh) [13:58:06] (03PS2) 10AikoChou: ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) [13:58:15] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:58:23] (03CR) 10Ayounsi: [C: 03+1] "I don't get what's up with Jenkins but lgtm once jenkins is happy." [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [13:58:53] (03PS3) 10EoghanGaffney: [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 [13:58:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43977/console" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [13:59:29] (03CR) 10EoghanGaffney: [gitlab/failover] Handle runner pausing exceptions (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [13:59:47] (03CR) 10Clément Goubert: [C: 03+1] benthos/mw_accesslog_metrics: rename the dc label [puppet] - 10https://gerrit.wikimedia.org/r/964914 (owner: 10Kamila Součková) [14:00:14] (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: rename the dc label [puppet] - 10https://gerrit.wikimedia.org/r/964914 (owner: 10Kamila Součková) [14:00:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [14:02:02] 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) 05Open→03Declined As mentioned above, abandoning this rename pursuit. There is a lot of stuff to rename and we won't get to it all,... [14:02:40] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [14:02:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye [14:04:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:53] (03PS1) 10Muehlenhoff: Add Daniel de Souza to deployers [puppet] - 10https://gerrit.wikimedia.org/r/964919 (https://phabricator.wikimedia.org/T348209) [14:05:24] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:05:37] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:06:02] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [14:06:20] (03CR) 10Gehel: wdqs: Set up graph_split hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [14:06:50] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [14:09:19] (03CR) 10Jelto: [ci/firewall] Add cumin+deploy hosts to CI http allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) (owner: 10EoghanGaffney) [14:10:25] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:12:41] (03CR) 10Ayounsi: "The diff in https://puppet-compiler.wmflabs.org/output/964918/43977/dns2004.wikimedia.org/index.html removes "10.3.0.1" from multiple loca" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:13:10] (03CR) 10Ssingh: [V: 03+1] hiera: announce ns1 IP from bird (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:13:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) on cloudvirt1064 during install i am getting when you reboot the server on console you get the server login prompt but since the system didn't comp... [14:14:14] (03PS2) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [14:15:13] (03PS2) 10Ssingh: hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) [14:15:29] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [14:15:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [14:15:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**) -... [14:17:29] (03CR) 10Ssingh: "Removing the skip_looback for ns1-v4 results in `Duplicate declaration: Augeas[lo_208.80.153.231/32] is already declared`, which is expect" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:17:36] (03CR) 10Gehel: "just adding a question for my own education..." [puppet] - 10https://gerrit.wikimedia.org/r/963964 (https://phabricator.wikimedia.org/T348315) (owner: 10Brouberol) [14:18:05] (03PS3) 10Ssingh: hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) [14:19:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) @papaul Just got same error on db1229 Execution of preseeded command "wget -O /tmp/late_command │ │ │ │ http://apt.wikimedia.org/autoinstall/scripts/la... [14:19:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43979/console" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:21:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) @BTullis did you have any update on Partitioning/Raid section? [14:23:34] (03PS4) 10Ssingh: hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) [14:24:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43980/console" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:25:40] 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10MoritzMuehlenhoff) >>! In T348429#9238826, @Jclark-ctr wrote: > Idrac is showing an error for fan. System Board Fan1A Standard Performance N/A 0 Thanks! We'll replace the server in Q4 with new hardware, hopefully... [14:27:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10BTullis) [14:30:13] (03CR) 10Ssingh: [V: 03+1] "This is still broken: https://puppet-compiler.wmflabs.org/output/964918/43980/dns3003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:30:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10BTullis) >>! In T342454#9239028, @Jclark-ctr wrote: > @BTullis did you have any update on Partitioning/Raid section? Hi @Jclark-ctr - Apologie... [14:31:48] (03CR) 10Ssingh: [V: 03+1] "The problem here is that skip_loopback: true removes it from all sites whereas we are only setting up ns1 in codfw. This is because the an" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [14:34:40] (03CR) 10Ayounsi: [C: 03+2] Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney) [14:35:25] (03Merged) 10jenkins-bot: Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney) [14:37:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:37:27] (03PS1) 10Hnowlan: service: change state to production for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964923 (https://phabricator.wikimedia.org/T336391) [14:38:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:02] (03CR) 10Brouberol: [C: 03+2] Install kafka-kit-prometheus-metricsfetcher on kafka brokers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963964 (https://phabricator.wikimedia.org/T348315) (owner: 10Brouberol) [14:41:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:42:34] (03PS3) 10Cathal Mooney: Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) [14:43:30] 10SRE, 10ops-codfw, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10Jhancock.wm) 05Open→03Resolved [14:43:35] (03CR) 10Ahmon Dancy: [C: 03+1] "Confirming that we received the email notification for successful train pre-sync today. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [14:43:46] (03CR) 10CI reject: [V: 04-1] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [14:44:18] (03CR) 10Clément Goubert: [C: 03+2] P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (https://phabricator.wikimedia.org/T342755) (owner: 10Clément Goubert) [14:44:50] 10sre-alert-triage, 10Release-Engineering-Team, 10Patch-For-Review: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert [14:46:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:48:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:03] (03CR) 10Ssingh: [C: 03+1] service: change state to production for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964923 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [14:51:34] (03CR) 10Volans: sre.hosts.reimage: update to support puppetserver (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:53:27] (03CR) 10Volans: [C: 03+1] "LGTM if it does what it says :D" [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [14:55:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [15:00:05] eoghan, jelto, and arnoldokoth: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1500). [15:06:43] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [15:08:18] (03PS1) 10Arturo Borrero Gonzalez: aborrero: drop access [labs/private] - 10https://gerrit.wikimedia.org/r/964926 [15:08:23] (03CR) 10Hnowlan: [C: 03+2] service: change state to production for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964923 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [15:14:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:23:56] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:23:59] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:14] (03PS2) 10Ebernhardson: admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) [15:26:16] (03PS1) 10Ebernhardson: cirrus-streaming-updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/964928 [15:28:06] (03PS17) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [15:30:30] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 [15:32:40] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10Jhancock.wm) @fnegri I'm also seeing a potentially failed DIMM. is it safe to power down the server for trou... [15:33:42] (03CR) 10Cwhite: [C: 03+2] opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite) [15:34:56] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1100'] [15:35:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [15:38:15] (03PS4) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) [15:39:13] (03CR) 10Ejegg: "@SBassett, we're hoping to go live with this in one week. Will it be possible to get this settings change +2ed soon?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [15:39:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) [15:39:52] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) @Jhancock.wm yes you can power it down. [15:40:26] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) [15:41:07] (03PS1) 10Physikerwelt: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964932 (https://phabricator.wikimedia.org/T137787) [15:42:14] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Gehel) [15:42:55] (03PS1) 10DCausse: rdf-streaming-udpater: restrict space usage alert from 1TiB to 50GiB [alerts] - 10https://gerrit.wikimedia.org/r/964934 [15:46:17] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [15:46:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [15:46:52] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Jhancock.wm) My two cents is to fix the issues so that we can stick to the original standard. I agree with Volans. [15:49:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [15:52:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:52:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:53:50] (03PS1) 10Arturo Borrero Gonzalez: aborrero: remove user [puppet] - 10https://gerrit.wikimedia.org/r/964940 [15:54:01] (NodeTextfileStale) resolved: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:54:19] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:54:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:54:38] (03CR) 10CI reject: [V: 04-1] aborrero: remove user [puppet] - 10https://gerrit.wikimedia.org/r/964940 (owner: 10Arturo Borrero Gonzalez) [15:55:16] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/964941 (https://phabricator.wikimedia.org/T338334) [15:55:40] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "don't merge this without testing this first in codfw1dev." [puppet] - 10https://gerrit.wikimedia.org/r/964941 (https://phabricator.wikimedia.org/T338334) (owner: 10Arturo Borrero Gonzalez) [15:58:03] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [15:58:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:07] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1600). [16:00:07] No Gerrit patches in the queue for this window AFAICS. [16:00:17] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:21] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:42] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T348550 (10phaultfinder) [16:02:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:02:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:03:23] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:03:40] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:28] (03PS7) 10Andrea Denisse: webperf: Move navtiming metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [16:05:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:05:17] (03PS8) 10Andrea Denisse: webperf: Move navtiming stats to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [16:05:32] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:06:26] (03CR) 10Andrea Denisse: webperf: Move navtiming stats to statsd-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [16:06:34] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:09:05] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:09:09] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ahoelzl) Approved. Thanks. [16:09:24] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:11:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:12:05] (03CR) 10SBassett: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [16:14:45] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:16:53] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:00] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10Jhancock.wm) Re: DIMM I've swapped B1 and B7. if the error recurs in B7, it is the stick. If it recurs in B1... [16:17:05] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1101 - jclark@cumin1001" [16:18:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1101 - jclark@cumin1001" [16:18:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:35] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:20:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:20:47] (03PS1) 10Hnowlan: trafficserver: route pageviews to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/964946 (https://phabricator.wikimedia.org/T336391) [16:21:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:21:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1101.mgmt.eqiad.wmnet with reboot policy FORCED [16:22:38] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Reedy) [16:22:49] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:25:39] (03CR) 10Majavah: Revert "admin: Temporarily disable legoktm's access" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964174 (owner: 10Legoktm) [16:25:43] (03PS2) 10Majavah: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/964174 (owner: 10Legoktm) [16:26:39] (03CR) 10Majavah: [C: 03+2] Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/964174 (owner: 10Legoktm) [16:29:23] (03CR) 10Xcollazo: [C: 03+1] Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:32:03] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:32:07] (03CR) 10Jbond: aborrero: remove user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964940 (owner: 10Arturo Borrero Gonzalez) [16:33:26] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964949 (https://phabricator.wikimedia.org/T308141) [16:33:45] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:33:50] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 14th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964929 (https://phabricator.wikimedia.org/T308139) [16:34:29] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:34:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:38:40] (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [16:39:42] (03CR) 10Physikerwelt: "I was trying to find a deployment windows on https://wikitech.wikimedia.org/wiki/Deployments to establish and document a mathoid deploymen" [deployment-charts] - 10https://gerrit.wikimedia.org/r/964932 (https://phabricator.wikimedia.org/T137787) (owner: 10Physikerwelt) [16:42:33] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10Jhancock.wm) new error popped up after rebooting T348550 [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1700) [17:04:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:06:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:06:16] (03PS1) 10Kamila Součková: kube-state-metrics: create image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/964950 (https://phabricator.wikimedia.org/T343801) [17:06:49] (03PS1) 10MusikAnimal: diffs: add line number headings to inline diffs [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964599 (https://phabricator.wikimedia.org/T346460) [17:09:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:09:57] (03PS5) 10Ssingh: hiera: announce ns1 IP from bird (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) [17:10:56] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10ssingh) Hi @NMariano-WMF: thanks for the request. We have some questions about this task, specifically related to some of the records requested here, that is better suited for a call. Is it fine if we se... [17:11:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43981/console" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [17:12:28] (03CR) 10Ssingh: [V: 03+1] "PCC looks OK, NOOP on non-codfw hosts. If someone has better ideas on how to skip loopback for the unicast address, I am all ears!" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [17:13:58] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add eqiad new row switches - cmooney@cumin1001" [17:14:18] 10SRE, 10Fundraising-Backlog, 10SRE Observability: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) [17:14:25] PROBLEM - OSPF status on ssw1-e1-eqiad.mgmt is CRITICAL: OSPFv2: 10/12 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:25] PROBLEM - OSPF status on ssw1-f1-eqiad.mgmt is CRITICAL: OSPFv2: 10/12 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:14:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add eqiad new row switches - cmooney@cumin1001" [17:14:52] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add eqiad new row switches - cmooney@cumin1001" [17:15:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add eqiad new row switches - cmooney@cumin1001" [17:21:08] (03CR) 10CI reject: [V: 04-1] diffs: add line number headings to inline diffs [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964599 (https://phabricator.wikimedia.org/T346460) (owner: 10MusikAnimal) [17:21:53] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e7-eqiad [17:21:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e7-eqiad [17:22:31] (03CR) 10MusikAnimal: "recheck" [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964599 (https://phabricator.wikimedia.org/T346460) (owner: 10MusikAnimal) [17:22:58] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f7-eqiad [17:23:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f7-eqiad [17:27:34] (03CR) 10BBlack: [C: 03+1] "LGTM! There's some risk of temporary conflict on initial puppetization of the codfw hosts (some timing or other conflict between the remov" [puppet] - 10https://gerrit.wikimedia.org/r/964918 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [17:28:01] (03PS1) 10Cathal Mooney: Remove Leaf devices in E1 and F1 from BGP RR List [homer/public] - 10https://gerrit.wikimedia.org/r/964952 (https://phabricator.wikimedia.org/T322937) [17:28:03] (03PS1) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/964953 [17:28:24] (03Abandoned) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/964953 (owner: 10Cathal Mooney) [17:28:50] (03PS2) 10Cathal Mooney: Remove Leaf devices in E1 and F1 from BGP RR List [homer/public] - 10https://gerrit.wikimedia.org/r/964952 (https://phabricator.wikimedia.org/T322937) [17:29:51] (03Abandoned) 10Cathal Mooney: Remove Leaf devices in E1 and F1 from BGP RR List [homer/public] - 10https://gerrit.wikimedia.org/r/964952 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [17:36:54] 10SRE, 10Fundraising-Backlog, 10SRE Observability: Simplify and fix icinga fr-tech user configuration - https://phabricator.wikimedia.org/T348559 (10Jgreen) [17:39:22] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e5-eqiad [17:41:38] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e5-eqiad [17:44:34] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e6-eqiad [17:46:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e6-eqiad [17:50:30] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f5-eqiad [17:51:11] 10SRE, 10DNS, 10Traffic: Update DNS records for Greenhouse - https://phabricator.wikimedia.org/T348335 (10NMariano-WMF) Hi @Lhiraide would you be ok with meeting with @ssingh since this is your request to have DNS updated for Greenhouse? [17:52:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f5-eqiad [17:56:26] !log disable BGP RR_CLIENT peerings on lsw1-e1-eqiad [17:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:59:35] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:00:07] hashar and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1800). [18:00:11] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Connect - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:11] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Connect - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:11] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:00:25] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Connect - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:27] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Connect - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:01:13] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:17] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:01:21] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:21] PROBLEM - BFD status on lsw1-f3-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:43] ^^ these are ok, bringing down those sessions should have downtimed [18:06:41] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 18 hosts with reason: changing bgp rr config [18:07:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 18 hosts with reason: changing bgp rr config [18:07:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60fd6a7d-c8e6-49a7-96ff-ccbed13297a2) set by cmooney@cumin1001 f... [18:08:31] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:08:39] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:08:39] RECOVERY - BFD status on lsw1-f3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:08:53] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:55] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:55] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:09:09] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:11] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:09:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: changing bgp rr config [18:10:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: changing bgp rr config [18:10:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=01394557-10ca-4b57-b8c9-c263e86708ec) set by cmooney@cumin1001 f... [18:11:29] (03PS1) 10Jbond: late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) [18:13:35] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:15:15] (03PS2) 10Jbond: late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) [18:15:18] !log brion running TimedMediaHandler requeueTranscodes.php batch jobs on mwmaint2002. expect many deletions & new file stores on swift [18:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [18:17:13] (03CR) 10Muehlenhoff: "Thanks, I'll take care of your offboarding, I'll fix up the patch myself." [puppet] - 10https://gerrit.wikimedia.org/r/964940 (owner: 10Arturo Borrero Gonzalez) [18:17:37] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [18:19:00] (03PS3) 10Jbond: late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) [18:21:13] (03CR) 10Muehlenhoff: [C: 03+2] "SSH key was validated out of band as well, merging." [puppet] - 10https://gerrit.wikimedia.org/r/964919 (https://phabricator.wikimedia.org/T348209) (owner: 10Muehlenhoff) [18:22:01] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [18:28:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10MoritzMuehlenhoff) [18:31:27] (03CR) 10JHathaway: [C: 03+1] late_command: update puppet installation logic [puppet] - 10https://gerrit.wikimedia.org/r/964959 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [18:44:09] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1005 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:48:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:21] (03CR) 10Gehel: Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:57:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52886 and previous config saved to /var/cache/conftool/dbconfig/20231010-185730-arnaudb.json [18:57:35] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:58:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @Papaul I see cloudelasticservers in site.pp it was added by Bking previously node /^cloudelastic1... [19:12:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P52887 and previous config saved to /var/cache/conftool/dbconfig/20231010-191236-arnaudb.json [19:14:03] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5001.eqsin.wmnet with OS bookworm [19:14:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm [19:18:39] jouncebot: nowandnext [19:18:40] For the next 0 hour(s) and 41 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T1800) [19:18:40] In 0 hour(s) and 41 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T2000) [19:19:58] (03PS4) 10Jforrester: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) [19:20:01] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) (owner: 10Jforrester) [19:21:32] (03Merged) 10jenkins-bot: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) (owner: 10Jforrester) [19:22:07] (03Abandoned) 10Jforrester: wikifunctions: Drop lgeacy main evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962719 (https://phabricator.wikimedia.org/T343388) (owner: 10Jforrester) [19:22:51] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:22:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:23:41] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:24:24] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:25:03] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:26:02] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:26:05] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:26:58] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:27:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P52888 and previous config saved to /var/cache/conftool/dbconfig/20231010-192742-arnaudb.json [19:29:05] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: changing bgp rr config [19:29:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: changing bgp rr config [19:29:20] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1acb901c-b161-4437-8a77-d11252fb6315) set by cmooney@cumin1001 for 2:00:00 on 6 host(s... [19:29:20] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 18 hosts with reason: changing bgp rr config [19:29:28] (03PS2) 10Jforrester: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964932 (https://phabricator.wikimedia.org/T137787) (owner: 10Physikerwelt) [19:29:32] (03CR) 10Jforrester: [C: 03+2] mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964932 (https://phabricator.wikimedia.org/T137787) (owner: 10Physikerwelt) [19:29:36] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 18 hosts with reason: changing bgp rr config [19:29:42] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7e1738b5-8479-4892-843b-26ddc9d964ea) set by cmooney@cumin1001 for 2:00:00 on 18 host(... [19:30:26] (03Merged) 10jenkins-bot: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/964932 (https://phabricator.wikimedia.org/T137787) (owner: 10Physikerwelt) [19:31:46] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [19:32:08] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [19:32:29] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [19:33:03] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [19:33:07] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [19:33:40] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [19:34:49] (03CR) 10HMonroy: [C: 03+2] diffs: add line number headings to inline diffs [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964599 (https://phabricator.wikimedia.org/T346460) (owner: 10MusikAnimal) [19:37:04] (03PS6) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) [19:41:27] (03CR) 10Eevans: cassandra: add utility wrapper & instance symlinks for sstableutil (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [19:42:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T343198)', diff saved to https://phabricator.wikimedia.org/P52889 and previous config saved to /var/cache/conftool/dbconfig/20231010-194249-arnaudb.json [19:42:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:42:55] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:43:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:43:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T343198)', diff saved to https://phabricator.wikimedia.org/P52890 and previous config saved to /var/cache/conftool/dbconfig/20231010-194311-arnaudb.json [19:47:18] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T345744 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:48:17] (03Merged) 10jenkins-bot: diffs: add line number headings to inline diffs [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964599 (https://phabricator.wikimedia.org/T346460) (owner: 10MusikAnimal) [19:49:00] !log hmonroy@deploy2002 Started scap: Backport for [[gerrit:964599|diffs: add line number headings to inline diffs (T346460)]] [19:49:05] T346460: Confusing diff for spatially disparate changes - https://phabricator.wikimedia.org/T346460 [19:50:31] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10BCornwall) 05Open→03Stalled [19:55:24] 10SRE, 10Cassandra, 10Data-Persistence: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231010T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:04] nothing in the calendar [20:01:14] hmonroy: ping me when you're done with your backport please? [20:01:58] taavi: will do [20:07:24] !log hmonroy@deploy2002 musikanimal and hmonroy: Backport for [[gerrit:964599|diffs: add line number headings to inline diffs (T346460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:33] T346460: Confusing diff for spatially disparate changes - https://phabricator.wikimedia.org/T346460 [20:07:58] !log hmonroy@deploy2002 musikanimal and hmonroy: Continuing with sync [20:13:45] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ncredir5001.eqsin.wmnet with OS bookworm [20:13:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm executed with errors: - ncredir5001 (**FAIL**) - Downt... [20:14:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5001.eqsin.wmnet with OS bookworm [20:14:14] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm [20:19:26] !log hmonroy@deploy2002 Finished scap: Backport for [[gerrit:964599|diffs: add line number headings to inline diffs (T346460)]] (duration: 30m 26s) [20:19:30] T346460: Confusing diff for spatially disparate changes - https://phabricator.wikimedia.org/T346460 [20:20:05] taavi: My backporting ended finally :) [20:20:38] thanks [20:24:50] (03PS1) 10Jdlrobson: Fixes Echo skin style for user message bar [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964600 (https://phabricator.wikimedia.org/T348530) [20:38:53] * taavi got distracted and finally starts deploying [20:39:05] (03PS3) 10Majavah: Set READ_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) [20:39:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:39:53] (03Merged) 10jenkins-bot: Set READ_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [20:40:16] !log taavi@deploy2002 Started scap: Backport for [[gerrit:963388|Set READ_NEW for CA wikis on OATHAuth multiple devices (T242031)]] [20:40:20] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [20:40:36] 10SRE, 10Infrastructure-Foundations, 10netops: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney) p:05Triage→03Low [20:41:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:40] !log taavi@deploy2002 taavi: Backport for [[gerrit:963388|Set READ_NEW for CA wikis on OATHAuth multiple devices (T242031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:43:01] !log taavi@deploy2002 taavi: Continuing with sync [20:45:51] 10SRE-swift-storage, 10Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) [20:48:41] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:963388|Set READ_NEW for CA wikis on OATHAuth multiple devices (T242031)]] (duration: 08m 24s) [20:48:45] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [20:48:49] * taavi done [20:51:08] taavi: a summary of status on that task (in the description) would be very helpful :) [20:51:11] or some sort of checklist of progres [21:09:21] Reedy: I am well aware of that fact, it's just that I was too lazy to make one :P but added one [21:09:27] <3 [21:25:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:25:45] (03PS1) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) [21:27:06] (03PS2) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) [21:30:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:33:14] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir5001.eqsin.wmnet with OS bookworm [21:33:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm executed with errors: - ncredir5001 (**FAIL**) - Remov... [21:34:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir5001.eqsin.wmnet with OS bookworm [21:34:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm [21:37:53] (03PS1) 10Jdlrobson: Move @font-size-base into mediawiki.skin.variables.less [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/964601 (https://phabricator.wikimedia.org/T348572) [21:42:44] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:35] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f6-eqiad [21:45:50] RECOVERY - OSPF status on ssw1-e1-eqiad.mgmt is OK: OSPFv2: 10/10 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:45:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f6-eqiad [21:46:22] RECOVERY - OSPF status on ssw1-f1-eqiad.mgmt is OK: OSPFv2: 10/10 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:50] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 236.99 ms [21:54:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10cmooney) Thanks @Jclark-ctr, I can confirm things look good (including light levels and pings I've not added here). ` cmooney@ssw1-f1-eqiad> show int... [21:55:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Papaul) @Jclark-ctr ok then the only thing left is to change it in netbox to use the public VLAN [21:59:30] (03PS2) 10Cathal Mooney: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) [22:00:45] (03PS3) 10Cathal Mooney: YAML config for EVPN top-of-rack switches in new eqiad racks [homer/public] - 10https://gerrit.wikimedia.org/r/940181 (https://phabricator.wikimedia.org/T334230) [22:06:47] (03PS3) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) [22:09:50] (03PS4) 10Cathal Mooney: Change EVPN IBGP to a single group and use separate RR cluster IDs [homer/public] - 10https://gerrit.wikimedia.org/r/964983 (https://phabricator.wikimedia.org/T348583) [22:13:01] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:13:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:13:21] (03PS2) 10EoghanGaffney: [ci/firewall] Add cumin+deploy hosts to CI http allow list [puppet] - 10https://gerrit.wikimedia.org/r/964881 (https://phabricator.wikimedia.org/T340788) [22:17:32] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10matmarex) [22:20:01] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10matmarex) See also... [22:23:17] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10matmarex) The Kubernetes work so far has caused problems with cross-wiki Echo notifications (see T223413, T342201). Please help resolve this before... [22:26:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @MoritzMuehlenhoff i was getting the error above on cloudvirt1064 and wanted to drop in the virtual console to see the syslog but when i restart th... [22:37:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change EPVN RR setup to use different cluster ID on each host - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:38:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:41:24] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bullseye [22:41:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change EPVN RR setup to use single BGP group and different cluster ID on every RR - https://phabricator.wikimedia.org/T348583 (10cmooney) [22:41:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @cmooney @ayounsi i check the virtual console on clouvirt1064 to see the reason i was getting the 2 above errors. it end up being the server is not... [22:45:36] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir5001.eqsin.wmnet with OS bookworm [22:45:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncredir5001.eqsin.wmnet with OS bookworm executed with errors: - ncredir5001 (**FAIL**) - Remov... [22:46:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) looking at the gerrit history about the late command i see also that there where some changes made today @jbond @Volans can you please also see if... [22:48:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:07:46] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:17:46] (Storage /var over 50%) firing: (4) Alert for device lsw1-a3-codfw.mgmt.codfw.wmnet - Storage /var over 50% got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [23:22:46] (Storage /var over 50%) firing: (10) Alert for device lsw1-a2-codfw.mgmt.codfw.wmnet - Storage /var over 50% got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [23:37:46] (Storage /var over 50%) resolved: (4) Alert for device lsw1-a3-codfw.mgmt.codfw.wmnet - Storage /var over 50% got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [23:41:32] PROBLEM - puppet last run on puppetboard2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:44:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state