[00:12:32] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [00:16:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:27:02] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963973 [00:38:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963973 (owner: 10TrainBranchBot) [00:42:04] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:48] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:12] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:38] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 174 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [00:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:55:06] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [00:55:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963973 (owner: 10TrainBranchBot) [00:56:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:16:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:38:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:54:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:59:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:03:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:18:02] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:18:16] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:00] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:22:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:27:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:32:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:42:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:15:12] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:44:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:01:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:23:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:45:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:50:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:17:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:22:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:29:33] (03CR) 10Ayounsi: [C: 03+2] Change ganeti-test2004's role to ganeti_test [puppet] - 10https://gerrit.wikimedia.org/r/964036 (https://phabricator.wikimedia.org/T345602) (owner: 10Ayounsi) [06:37:30] (03CR) 10Elukey: [C: 03+2] team-sre: improve k8s high api latency monitor [alerts] - 10https://gerrit.wikimedia.org/r/964025 (owner: 10Elukey) [06:38:56] (03CR) 10Elukey: [V: 03+1 C: 03+2] icinga/nagios: remove check_ores* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960567 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [06:46:09] (03CR) 10Elukey: team-ml: add alert for memory spike in inf services (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [06:48:39] (03CR) 10Elukey: Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [06:56:34] (03CR) 10Muehlenhoff: [C: 03+2] testreduce: Auto-restart parsoid-rt server/client and mariadb on failures [puppet] - 10https://gerrit.wikimedia.org/r/963996 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [07:00:06] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T0700) [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:01:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on db2109.codfw.wmnet with reason: investigating db2109 [07:02:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on db2109.codfw.wmnet with reason: investigating db2109 [07:02:17] (03CR) 10Elukey: "Quick question to understand - what happens if I run /usr/local/bin/sstable-util-instance directly? I know it is not meant to be run, but " [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [07:03:32] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:35] !log kill stuck updateSpecialPages.php process on mwmaint2002 which was trying to re-connect to an unreachable db host [07:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:00] (03PS2) 10Muehlenhoff: Set KbdInteractiveAuthentication/ChallengeResponseAuthentication per OS [puppet] - 10https://gerrit.wikimedia.org/r/961775 [07:11:02] 10SRE, 10ops-eqiad: Broken disk on ganeti1022 - https://phabricator.wikimedia.org/T348429 (10MoritzMuehlenhoff) [07:11:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961775 (owner: 10Muehlenhoff) [07:13:08] (03PS1) 10Slyngshede: Pull in local copy of Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/964412 [07:14:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10ayounsi) Small regression: iptables logs are written to disk in `/var/log/ulogd/syslog.log` to not flood the main syslog.log files. But nfta... [07:17:54] (03PS3) 10Muehlenhoff: Set KbdInteractiveAuthentication/ChallengeResponseAuthentication per OS [puppet] - 10https://gerrit.wikimedia.org/r/961775 [07:19:11] (03PS2) 10Slyngshede: Pull in local copy of Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/964412 [07:19:16] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:19:30] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Pull in local copy of Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/964412 (owner: 10Slyngshede) [07:22:38] (03CR) 10Muehlenhoff: Pull in local copy of Codex. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/964412 (owner: 10Slyngshede) [07:24:54] (03CR) 10Elukey: team-ml: add alert for Kafka consumer lag for ores extension (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [07:27:08] (03PS1) 10Slyngshede: Add CSS license information [software/bitu] - 10https://gerrit.wikimedia.org/r/964415 [07:28:08] (03CR) 10Slyngshede: "Note that Codex is actually GPL version 2" [software/bitu] - 10https://gerrit.wikimedia.org/r/964415 (owner: 10Slyngshede) [07:53:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961775 (owner: 10Muehlenhoff) [08:03:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:04] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Volans) @cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that hav... [08:10:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) For the record as Giuseppe is out, I had a chat with @CDanis going over the plan and numbers and we didn't find anything worrisome or... [08:12:43] 10SRE: On Wikispecies and other wikis, the data in Special:LonelyPages has not been updated since 25 September - https://phabricator.wikimedia.org/T348433 (10Peachey88) [08:15:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:37] 10SRE: On Wikispecies and other wikis, the data in Special:LonelyPages has not been updated since 25 September - https://phabricator.wikimedia.org/T348433 (10Korg) [08:18:30] (03PS1) 10Cathal Mooney: Update IP to use in ECS field when making dns requests for 'esams' [cookbooks] - 10https://gerrit.wikimedia.org/r/964417 (https://phabricator.wikimedia.org/T344579) [08:22:52] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [08:23:10] (03CR) 10Btullis: Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:23:27] !log rebuilt bullseye d-i image for the Bullseye 11.9 point release T348327 [08:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:31] T348327: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 [08:23:39] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/964417 (https://phabricator.wikimedia.org/T344579) (owner: 10Cathal Mooney) [08:24:12] (03CR) 10Cathal Mooney: [C: 03+2] Update IP to use in ECS field when making dns requests for 'esams' [cookbooks] - 10https://gerrit.wikimedia.org/r/964417 (https://phabricator.wikimedia.org/T344579) (owner: 10Cathal Mooney) [08:25:24] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [08:26:48] (03Merged) 10jenkins-bot: Update IP to use in ECS field when making dns requests for 'esams' [cookbooks] - 10https://gerrit.wikimedia.org/r/964417 (https://phabricator.wikimedia.org/T344579) (owner: 10Cathal Mooney) [08:30:52] (03PS5) 10Volans: locking: add new module for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/938822 (https://phabricator.wikimedia.org/T341973) [08:31:09] (03PS1) 10Slyngshede: Implement feedback from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/964420 [08:39:47] 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Clement_Goubert) [08:41:32] 10SRE, 10DBA: Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10ABran-WMF) script has been killed by @taavi https://phabricator.wikimedia.org/T348428 host has been downtimed. [08:47:38] (03CR) 10Volans: [C: 03+2] locking: add new module for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/938822 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:50:16] (03CR) 10Btullis: [V: 03+1] Support configuring the spark3 defaults with the default shuffler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:52:20] (03Merged) 10jenkins-bot: locking: add new module for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/938822 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:53:01] (03PS5) 10Volans: cookbook: add --no-locks CLI argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/938823 (https://phabricator.wikimedia.org/T341973) [08:53:21] !log rebuilt bookworm d-i image for the Bookworm 12.2 point release T348326 [08:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:25] T348326: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 [08:53:47] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [08:55:00] (03CR) 10Brouberol: [C: 03+1] Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:55:08] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1070.eqiad.wmnet [08:55:09] (03CR) 10Brouberol: [C: 03+2] Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:55:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/964415 (owner: 10Slyngshede) [09:01:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1070.eqiad.wmnet [09:01:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1071.eqiad.wmnet [09:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:06:01] (03PS19) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [09:07:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1071.eqiad.wmnet [09:07:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1072.eqiad.wmnet [09:07:47] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:08:29] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:09:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:13:31] (03CR) 10Volans: [C: 03+2] cookbook: add --no-locks CLI argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/938823 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:14:04] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add CSS license information [software/bitu] - 10https://gerrit.wikimedia.org/r/964415 (owner: 10Slyngshede) [09:14:26] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement feedback from design team. [software/bitu] - 10https://gerrit.wikimedia.org/r/964420 (owner: 10Slyngshede) [09:17:30] (03Merged) 10jenkins-bot: cookbook: add --no-locks CLI argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/938823 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:18:20] (03CR) 10Hashar: "That is for the Jenkins agents in the WMCS `integration` project. There is at least one job pushing build artifacts to Gerrit for human re" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [09:23:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:25:55] (03CR) 10Ilias Sarantopoulos: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:26:46] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) This is the patch to enable the single NIC setup on ceph nodes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/856675/ Is marked as abando... [09:28:28] PROBLEM - SSH on analytics1072 is CRITICAL: connect to address 10.64.21.116 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:28:47] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) Unfortunately, it seems that the cluster has grown in the last few days :/, as draining the last 21 osd d... [09:29:17] (03CR) 10JMeybohm: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [09:33:33] 10SRE, 10DBA: Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10Ladsgroup) The host was depooled and given that the script is taking really long, it didn't update its config. I think there are two long-term questio... [09:39:41] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: add 'cloud' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [09:45:23] (03PS3) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) [09:46:32] PROBLEM - Host releases2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:38] PROBLEM - Host kafkamon2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:48] PROBLEM - Host mx2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:54] PROBLEM - Host poolcounter2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:47:57] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [09:48:34] (ProbeDown) firing: Service releases2003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:48:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [09:49:05] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [09:49:44] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host analytics1072.eqiad.wmnet [09:49:47] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1073.eqiad.wmnet [09:52:48] jouncebot: nowandnext [09:52:48] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [09:52:49] In 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1000) [09:53:14] (03CR) 10Ladsgroup: [C: 03+2] Set virtual domain mapping for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963837 (https://phabricator.wikimedia.org/T330590) (owner: 10Ladsgroup) [09:53:32] (JobUnavailable) firing: (6) Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:11] (03Merged) 10jenkins-bot: Set virtual domain mapping for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963837 (https://phabricator.wikimedia.org/T330590) (owner: 10Ladsgroup) [09:55:06] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]] [09:55:15] T330590: External LBs should not be exposed to developers - https://phabricator.wikimedia.org/T330590 [09:55:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:44] Hej guys, can see status is up, any issues? I'm trying to revert an edit on Commons and getting a critical exception message [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1000) [10:00:59] odder: 5xx rates look good, could you share the details of the exception message? [10:02:20] Yeah, the error message is: [7d8aad20-276e-4d31-995d-eac887244a0a] 2023-10-09 09:59:51: Krytyczny wyjątek typu „Wikimedia\Rdbms\DBUnexpectedError” [10:03:12] Trying to revert the last IP edit from https://commons.wikimedia.org/w/index.php?title=File:Pl-Piotr_Zieli%C5%84ski.ogg&action=history [10:03:50] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:03:56] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:04:00] !log lucaswerkmeister-wmde@mwdebug1002:~$ sudo -u mwdeploy sh -c 'rm /srv/mediawiki/php-1.40.0-wmf.17/cache/l10n/l10n_cache-*.cdb && rmdir /srv/mediawiki/php-1.40.0-wmf.17/cache/l10n/ /srv/mediawiki/php-1.40.0-wmf.17/cache/ /srv/mediawiki/php-1.40.0-wmf.17/ # clean up old l10n cache' [10:04:00] T330590: External LBs should not be exposed to developers - https://phabricator.wikimedia.org/T330590 [10:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:04] Seems to be https://phabricator.wikimedia.org/T348375 [10:05:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:39] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:05:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [10:06:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:32] vgutierrez: Think I best add it to that bug report on Phabricator? [10:10:08] PROBLEM - SSH on analytics1073 is CRITICAL: connect to address 10.64.21.117 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:10:09] odder: indeed, I was pinging some folks that could take a look [10:10:42] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:963837|Set virtual domain mapping for url shortener (T330590)]] (duration: 15m 35s) [10:10:45] T330590: External LBs should not be exposed to developers - https://phabricator.wikimedia.org/T330590 [10:11:05] vgutierrez: Okay, will add it in a moment, thank you [10:11:13] fpm restarts mostly failed [10:11:16] https://www.irccloud.com/pastebin/EsSrlfDL/ [10:11:22] 10:10:41 php-fpm-restart: 100% (in-flight: 0; ok: 166; fail: 148; left: 0) [10:11:38] RECOVERY - SSH on analytics1073 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:11:39] ouch [10:12:20] poolcounter2004 got decommissioned on 2023-10-05 [10:12:28] (per SAL) [10:13:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1073.eqiad.wmnet [10:13:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1074.eqiad.wmnet [10:13:42] (03PS1) 10Clément Goubert: trafficserver: move 15% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964447 (https://phabricator.wikimedia.org/T348122) [10:13:44] (03PS1) 10Clément Goubert: trafficserver: move 20% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964448 (https://phabricator.wikimedia.org/T348122) [10:13:46] (03PS1) 10Clément Goubert: trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122) [10:14:18] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:14:21] (JobUnavailable) firing: (6) Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:14:23] hmm forget it.. orespoolcounter2004 != poolcounter2004 :) [10:15:17] poolcounter2004 looks in a bad way [10:15:36] RECOVERY - Host kafkamon2003 is UP: PING OK - Packet loss = 0%, RTA = 31.84 ms [10:16:02] RECOVERY - Host mx2001 is UP: PING OK - Packet loss = 0%, RTA = 31.91 ms [10:16:36] RECOVERY - Host releases2003 is UP: PING OK - Packet loss = 0%, RTA = 31.95 ms [10:17:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [10:17:17] odder: that's https://phabricator.wikimedia.org/T348375#9233921 [10:17:41] I can't reach poolcounter2004 directly or via mgmt interface [10:17:57] Amir1_: Yes, I'm adding a comment there now with my error ID [10:18:37] (JobUnavailable) firing: (6) Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:43] claime: it should be back in ~1 min [10:18:52] (ProbeDown) resolved: Service releases2003:443 has failed probes (http_releases_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:34] RECOVERY - Host poolcounter2004 is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [10:20:01] moritzm: :? [10:20:20] got bitten by T273026, I fixed up the interface name over the "serial" console in Ganeti [10:20:21] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:20:42] so a timing issue [10:20:56] between instance maintenance and the train [10:21:10] and poolcounter2004 ran on ganeti2031, which deadlocked with a DRBD kernel error [10:22:24] Thanks for your help Amir1_ & vgutierrez - left my comment with the error hash there [10:22:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1074.eqiad.wmnet [10:22:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1075.eqiad.wmnet [10:23:33] (JobUnavailable) firing: (6) Reduced availability for job burrow in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:32] RECOVERY - SSH on analytics1072 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:24:57] I made a patch that fixes it [10:25:31] Yeah, I saw, thanks :-) [10:26:50] 10SRE, 10DBA: Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10Ladsgroup) I got my answer in T348428 sigh [10:29:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1075.eqiad.wmnet [10:29:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1076.eqiad.wmnet [10:29:16] !log installing Linux 6.1.55 on Bookworm hosts [10:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1076.eqiad.wmnet [10:34:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host analytics1077.eqiad.wmnet [10:39:20] (03PS1) 10Elukey: ml-services: add prometheus annotations for revscoring goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/964456 [10:39:28] (03PS1) 10Clément Goubert: mw-api-ext, mw-web: Raise replicas 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/964457 (https://phabricator.wikimedia.org/T348122) [10:40:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1077.eqiad.wmnet [10:47:29] (03CR) 10Elukey: [C: 03+2] ml-services: add prometheus annotations for revscoring goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/964456 (owner: 10Elukey) [10:48:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1078.eqiad.wmnet [10:48:41] (03CR) 10Klausman: [C: 03+1] ml-services: add prometheus annotations for revscoring goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/964456 (owner: 10Elukey) [10:50:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:52:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add prometheus annotations for revscoring goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/964456 (owner: 10Elukey) [10:53:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1078.eqiad.wmnet [10:53:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [10:53:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1079.eqiad.wmnet [10:58:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:59:13] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [10:59:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:00:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1079.eqiad.wmnet [11:00:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1080.eqiad.wmnet [11:01:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:04:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:06:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1080.eqiad.wmnet [11:06:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1081.eqiad.wmnet [11:12:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1081.eqiad.wmnet [11:12:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1082.eqiad.wmnet [11:13:24] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964464 [11:18:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1082.eqiad.wmnet [11:18:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1083.eqiad.wmnet [11:19:07] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964464 (owner: 10Volans) [11:23:05] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/964464 (owner: 10Volans) [11:23:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:24:01] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:26:32] (03CR) 10Muehlenhoff: late_command.sh: Add logic to rerad puppet version from config-master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:26:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1083.eqiad.wmnet [11:26:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1084.eqiad.wmnet [11:31:51] (03PS2) 10Muehlenhoff: Mark mediawiki-testers as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960546 (https://phabricator.wikimedia.org/T276465) [11:32:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1084.eqiad.wmnet [11:32:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1085.eqiad.wmnet [11:34:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960546 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [11:35:43] (03Abandoned) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [11:36:30] (03PS2) 10Muehlenhoff: openldap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945554 [11:36:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945554 (owner: 10Muehlenhoff) [11:47:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1085.eqiad.wmnet [11:47:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1086.eqiad.wmnet [11:47:22] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [11:47:35] (03PS4) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [11:50:09] (03CR) 10Muehlenhoff: [C: 03+2] openldap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945554 (owner: 10Muehlenhoff) [11:51:26] !log restart k8s-aux in eqiad to pick up new certs - T343529 [11:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:29] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [11:53:30] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:54:29] (03PS1) 10Filippo Giunchedi: netops: remove pingoffload alert from esams [alerts] - 10https://gerrit.wikimedia.org/r/964521 (https://phabricator.wikimedia.org/T345743) [11:54:47] (03CR) 10Majavah: [C: 04-1] "This is a good start, we can go from here. Couple of things inline. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [11:56:02] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: remove pingoffload alert from esams [alerts] - 10https://gerrit.wikimedia.org/r/964521 (https://phabricator.wikimedia.org/T345743) (owner: 10Filippo Giunchedi) [11:57:54] (03PS1) 10Muehlenhoff: debmonitor: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964522 [11:59:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/964522 (owner: 10Muehlenhoff) [12:03:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1086.eqiad.wmnet [12:04:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1087.eqiad.wmnet [12:04:36] (03PS1) 10EoghanGaffney: [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 [12:05:44] (03PS3) 10Filippo Giunchedi: prometheus: add 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) [12:06:03] (03CR) 10Filippo Giunchedi: prometheus: add 'cloud' instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [12:06:53] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [12:07:27] (03PS5) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [12:07:43] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964522 (owner: 10Muehlenhoff) [12:09:49] (03PS7) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [12:10:14] (03CR) 10Muehlenhoff: late_command.sh: Add logic to rerad puppet version from config-master (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:10:29] (03CR) 10Majavah: [C: 03+1] "let's give this a try" [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [12:10:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1087.eqiad.wmnet [12:10:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1088.eqiad.wmnet [12:10:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/964522 (owner: 10Muehlenhoff) [12:16:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1088.eqiad.wmnet [12:16:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1089.eqiad.wmnet [12:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:46] (03PS6) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [12:20:48] (03PS1) 10Jbond: install_server: add directory for host metadata [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) [12:22:39] 10SRE, 10Cloud-Services, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10MoritzMuehlenhoff) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.or... [12:23:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1089.eqiad.wmnet [12:23:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1090.eqiad.wmnet [12:23:28] (03CR) 10CI reject: [V: 04-1] install_server: add directory for host metadata [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:24:38] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10User-aborrero: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10taavi) [12:25:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:26:06] (03PS2) 10Jbond: install_server: add directory for host metadata [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) [12:26:08] (03PS7) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) [12:26:24] (03CR) 10Jbond: late_command.sh: Add logic to rerad puppet version from config-master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:28:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1090.eqiad.wmnet [12:28:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1091.eqiad.wmnet [12:34:52] (03PS1) 10Volans: Upstream release v7.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964525 [12:35:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1091.eqiad.wmnet [12:35:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1092.eqiad.wmnet [12:36:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/963987 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [12:36:30] (03CR) 10Volans: [C: 03+1] "Looks sane to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [12:39:02] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) p:05Triage→03Low [12:39:35] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [12:40:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:40:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1092.eqiad.wmnet [12:40:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1093.eqiad.wmnet [12:41:16] (03CR) 10Volans: "Question about the plan inline" [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:42:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks, but we need a Puppet patch to create the metadata directory in C:install_server::preseed_server" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:42:17] (03Merged) 10jenkins-bot: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [12:42:52] (03CR) 10Filippo Giunchedi: [C: 04-1] webperf: Move navtiming logs to statsd-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [12:44:00] (03CR) 10Muehlenhoff: [C: 03+2] Mark mediawiki-testers as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960546 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [12:45:26] (03CR) 10Jbond: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:46:15] (03CR) 10Muehlenhoff: [C: 03+2] Set KbdInteractiveAuthentication/ChallengeResponseAuthentication per OS [puppet] - 10https://gerrit.wikimedia.org/r/961775 (owner: 10Muehlenhoff) [12:46:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1093.eqiad.wmnet [12:46:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1094.eqiad.wmnet [12:46:42] (03CR) 10Jbond: install_server: add directory for host metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964524 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:46:49] (03CR) 10Volans: late_command.sh: Add logic to rerad puppet version from config-master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:48:41] (03CR) 10Jbond: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:49:54] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:51:05] (03CR) 10Volans: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:51:17] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) [12:51:57] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) [12:52:06] (03PS2) 10Volans: Upstream release v7.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964525 [12:52:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1094.eqiad.wmnet [12:52:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1095.eqiad.wmnet [12:52:48] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:10] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:53:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:56:03] (03CR) 10Jbond: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:57:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964525 (owner: 10Volans) [12:57:52] (03CR) 10Volans: [C: 03+2] Upstream release v7.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964525 (owner: 10Volans) [12:58:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1095.eqiad.wmnet [12:58:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:19] (03PS2) 10Majavah: Set READ_NEW for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) [13:00:31] I'll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/963388 [13:00:34] ok [13:00:44] I’d like to test something on mwdebug afterwards, but it’s not urgent :) [13:00:48] thank you jouncebot. i'm a bit surprised though. https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar claims no deploys should be done on Oct 09? [13:00:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963388 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:00:57] * Lucas_WMDE looks [13:01:00] does it? [13:01:04] * taavi removes the +2 [13:01:04] it does [13:01:11] hm, indeed it does [13:01:15] a-ha [13:01:19] but we did bunch of deploys anyway, and calendar isn't removed. so...maybe releng changed it. [13:01:27] didn’t we have this earlier this year already? US holiday vs. global holiday? [13:01:56] I feel like I've complained before about the calendar not showing no-deploy days.. [13:02:19] i have a bash quote on that, even: https://bash.toolforge.org/quip/EjdsfngB1jz_IcWuk0b- [13:02:26] I think https://wm-bot.wmcloud.org/browser/index.php?start=06%2F19%2F2023&end=06%2F19%2F2023&display=%23wikimedia-operations is the one I remembered [13:02:41] urbanecm: lol [13:02:57] (03Merged) 10jenkins-bot: Upstream release v7.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/964525 (owner: 10Volans) [13:05:13] doesn’t look like there’s a proper phab task for automatically adding holidays as non-deploy days to the calendar, closest I found is this comment https://phabricator.wikimedia.org/T293101#7423699 [13:05:34] i'll file one [13:05:58] ty [13:06:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [13:06:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [13:06:47] thanks [13:07:32] T348447 [13:07:33] T348447: Deployment calendar shows deployments windows on no-deploy days - https://phabricator.wikimedia.org/T348447 [13:08:37] (03CR) 10Muehlenhoff: [C: 03+1] late_command.sh: Add logic to rerad puppet version from config-master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:09:31] taavi: i LOVE your steps to reproduce. [13:10:02] :D [13:10:24] the browser version is also a very important thing in the template that I made sure to fill out [13:10:31] yes, I thought so too [13:10:55] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/964522 (owner: 10Muehlenhoff) [13:11:08] (and i have US holidays imported into my work calendars for a reason :D) [13:11:37] I was going to test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/960564/ on an mwdebug server and see how it behaves – do you think that’s acceptable on a no-deploy day? [13:11:38] (03PS20) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:12:30] (03PS1) 10Filippo Giunchedi: prometheus: use openstack_exporter_host in prometheus cloud [puppet] - 10https://gerrit.wikimedia.org/r/964530 (https://phabricator.wikimedia.org/T336854) [13:13:41] (03CR) 10Volans: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:15:05] (03CR) 10Ilias Sarantopoulos: "Fixed!" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:16:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [13:16:15] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1099.eqiad.wmnet [13:16:27] (03PS1) 10Muehlenhoff: visualdiff: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/964531 [13:17:05] (03CR) 10Volans: late_command.sh: Add logic to rerad puppet version from config-master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964014 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:17:31] ok, I’ll try my change on mwdebug unless someone tells me not to – I think it should be safe enough [13:20:13] (03CR) 10Majavah: [C: 03+1] prometheus: use openstack_exporter_host in prometheus cloud [puppet] - 10https://gerrit.wikimedia.org/r/964530 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:20:28] * Lucas_WMDE testing on mwdebug2002 [13:23:00] (03CR) 10Volans: sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:23:06] Lucas_WMDE: +1 please do [13:23:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1099.eqiad.wmnet [13:24:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1100.eqiad.wmnet [13:26:20] Lucas_WMDE: and maybe that can be hidden behind a feature flag? [13:26:32] * Lucas_WMDE looks up how to restart php-fpm again [13:27:22] hmph [13:27:33] neither `sudo -u mwdeploy restart-php7.4-fpm` nor `sudo restart-php7.4-fpm`, apparently… [13:27:57] claime: ^ any clue how one can restart php fpm? :) [13:28:37] Lucas_WMDE: try `sudo -iu mwdeploy restart-php7.4-fpm`? [13:28:41] (the one with -u mwdeploy spit out some permission errors and possibly left the host in a bad state, but presumably nothing I couldn’t fix with `scap pull` – I just don’t want to nuke my changes already ^^) [13:28:43] /usr/local/sbin/restart-php7.4-fpm [13:28:55] or `sudo -i restart-php7.4-fpm`, actually [13:29:00] Yeah, sudo -i [13:29:12] <3 [13:29:26] hm, same error still [13:29:36] what about with the full path? [13:29:56] `sudo -i /usr/local/sbin/restart-php7.4-fpm` [13:30:00] that was the right combination apparently [13:30:04] thanks all [13:30:25] too many *bin/ directories [13:30:54] Yeah the sudo rule is on the full path and the script needs to be run as a login shell because of some environment shenanigans [13:30:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use openstack_exporter_host in prometheus cloud [puppet] - 10https://gerrit.wikimedia.org/r/964530 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:31:10] modules/admin/data/data.yaml L165 [13:32:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1100.eqiad.wmnet [13:32:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1101.eqiad.wmnet [13:32:11] (03PS1) 10Elukey: team-sre: make KubernetesAPILatency more lenient [alerts] - 10https://gerrit.wikimedia.org/r/964534 [13:33:24] (03CR) 10CI reject: [V: 04-1] team-sre: make KubernetesAPILatency more lenient [alerts] - 10https://gerrit.wikimedia.org/r/964534 (owner: 10Elukey) [13:35:32] okay, test successful, the parser cache seems to behave as described in the commit message in production too [13:35:44] !log uploaded spicerack_7.4.0 to apt.wikimedia.org bullseye-wikimedia [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] I still don’t understand *why* but I think that’s probably good enough to try it in production [13:36:02] `scap pull`ed on mwdebug2002 [13:36:24] and my changes seem to be gone from the files too [13:36:28] so I think that’s it :) [13:36:28] * Lucas_WMDE done [13:40:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1101.eqiad.wmnet [13:40:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1102.eqiad.wmnet [13:40:28] claime: I added the command to https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&diff=prev&oldid=2118397, I hope that’s okay [13:40:42] (https://wikitech.wikimedia.org/wiki/Application_servers/Runbook also has it tucked away somewhere but that part didn’t feel very linkable to me) [13:40:45] Lucas_WMDE: congratulations :) [13:41:50] Lucas_WMDE: Sure, thanks! [13:43:01] (03PS1) 10Muehlenhoff: Update name for ignore lintian tag [puppet] - 10https://gerrit.wikimedia.org/r/964539 [13:43:03] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [13:43:03] (03PS2) 10Elukey: team-sre: make KubernetesAPILatency more lenient [alerts] - 10https://gerrit.wikimedia.org/r/964534 [13:44:10] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/964539 (owner: 10Muehlenhoff) [13:44:41] (03CR) 10Elukey: [C: 04-1] "Need to work a bit more on it" [alerts] - 10https://gerrit.wikimedia.org/r/964534 (owner: 10Elukey) [13:44:43] (03PS1) 10Majavah: P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) [13:45:10] (03CR) 10CI reject: [V: 04-1] P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [13:46:00] (03PS2) 10Majavah: P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) [13:46:04] (03CR) 10Jelto: [C: 04-1] "the gerrit ssh key is hardcoded in profile::zuul:merger and profile::gerrit already:" [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [13:46:19] (03CR) 10CI reject: [V: 04-1] P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [13:46:39] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [13:46:43] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [13:46:50] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [13:46:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1102.eqiad.wmnet [13:46:54] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1103.eqiad.wmnet [13:46:58] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [13:47:35] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [13:47:47] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [13:48:25] (03PS3) 10Majavah: P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) [13:48:30] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [13:48:33] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [13:48:55] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1001.eqiad.wmnet [13:51:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43963/console" [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [13:52:14] (03CR) 10Muehlenhoff: [C: 03+2] Update name for ignore lintian tag [puppet] - 10https://gerrit.wikimedia.org/r/964539 (owner: 10Muehlenhoff) [13:54:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1103.eqiad.wmnet [13:54:58] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:54:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1104.eqiad.wmnet [13:55:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:55:46] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [13:57:58] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:18] (03CR) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [13:58:25] (03CR) 10Muehlenhoff: [C: 03+1] sre.hosts.reimage: update to support puppetserver (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [13:58:44] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1104.eqiad.wmnet [14:02:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1105.eqiad.wmnet [14:05:17] (03PS1) 10Clément Goubert: team-sre/mediawiki: Raise parsoid alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/964542 (https://phabricator.wikimedia.org/T348231) [14:08:52] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960034 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [14:10:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1105.eqiad.wmnet [14:10:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1106.eqiad.wmnet [14:17:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10cmooney) @Jclark-ctr let me know when you have time to look at this now that the optics have been received. thanks :) [14:17:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1106.eqiad.wmnet [14:18:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1107.eqiad.wmnet [14:24:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1107.eqiad.wmnet [14:25:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1108.eqiad.wmnet [14:25:56] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [14:26:38] (03CR) 10Subramanya Sastry: [C: 03+1] visualdiff: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/964531 (owner: 10Muehlenhoff) [14:29:42] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [14:29:59] (03PS1) 10Dreamy Jazz: Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) [14:30:30] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) [14:30:40] (03CR) 10CI reject: [V: 04-1] Enable display of Client Hints data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [14:31:58] (03CR) 10AOkoth: [C: 03+1] gitlab: install warning banner only on replicas when doing a restore [puppet] - 10https://gerrit.wikimedia.org/r/964003 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [14:32:01] (03CR) 10Dreamy Jazz: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964545 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [14:32:14] (03CR) 10Filippo Giunchedi: "LGTM, let's give it a try" [alerts] - 10https://gerrit.wikimedia.org/r/964542 (https://phabricator.wikimedia.org/T348231) (owner: 10Clément Goubert) [14:32:21] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/mediawiki: Raise parsoid alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/964542 (https://phabricator.wikimedia.org/T348231) (owner: 10Clément Goubert) [14:32:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1108.eqiad.wmnet [14:32:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1109.eqiad.wmnet [14:33:46] (03CR) 10Clément Goubert: [C: 03+2] team-sre/mediawiki: Raise parsoid alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/964542 (https://phabricator.wikimedia.org/T348231) (owner: 10Clément Goubert) [14:34:59] (03Merged) 10jenkins-bot: team-sre/mediawiki: Raise parsoid alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/964542 (https://phabricator.wikimedia.org/T348231) (owner: 10Clément Goubert) [14:38:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1109.eqiad.wmnet [14:40:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1110.eqiad.wmnet [14:45:01] (03CR) 10Jelto: [C: 03+2] gitlab: install warning banner only on replicas when doing a restore [puppet] - 10https://gerrit.wikimedia.org/r/964003 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [14:46:55] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [14:47:15] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Krinkle) p:05Tria... [14:47:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1110.eqiad.wmnet [14:47:47] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1111.eqiad.wmnet [14:48:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:35] (03PS1) 10Elukey: profile::prometheus::k8s: add k8s-pods-kserve-metrics config [puppet] - 10https://gerrit.wikimedia.org/r/964551 [14:52:46] (03PS2) 10Elukey: profile::prometheus::k8s: add k8s-pods-kserve-metrics config [puppet] - 10https://gerrit.wikimedia.org/r/964551 [14:55:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1111.eqiad.wmnet [14:55:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1112.eqiad.wmnet [14:57:47] (03PS3) 10Elukey: profile::prometheus::k8s: add k8s-pods-kserve-metrics config [puppet] - 10https://gerrit.wikimedia.org/r/964551 [14:58:18] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:26] (03PS3) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) [14:59:06] (03CR) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [15:00:06] (03PS4) 10Majavah: P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) [15:00:44] (03CR) 10Elukey: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963683 (owner: 10Klausman) [15:01:24] (03CR) 10Elukey: "of course I added a comment and it is not wip anymore, sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/963683 (owner: 10Klausman) [15:01:49] (03CR) 10Majavah: P:prometheus::cloud: make openstack deployment a parameter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [15:02:34] (03PS4) 10Elukey: profile::prometheus::k8s: add k8s-pods-kserve config [puppet] - 10https://gerrit.wikimedia.org/r/964551 [15:02:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1112.eqiad.wmnet [15:02:51] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1113.eqiad.wmnet [15:03:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [15:05:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43966/console" [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [15:07:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43967/console" [puppet] - 10https://gerrit.wikimedia.org/r/964551 (owner: 10Elukey) [15:08:13] !log installing nftables bugfix updates from Bookworm point release [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] !log installed spicerack 7.4.0 to cumin2002 [15:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:48] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1113.eqiad.wmnet [15:12:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1114.eqiad.wmnet [15:13:20] (03PS1) 10Muehlenhoff: Add library hint for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964555 [15:13:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [15:14:48] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:03] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for nftables [puppet] - 10https://gerrit.wikimedia.org/r/964555 (owner: 10Muehlenhoff) [15:17:19] (03CR) 10Muehlenhoff: [C: 03+2] visualdiff: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/964531 (owner: 10Muehlenhoff) [15:18:16] 10sre-alert-triage: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10Aklapper) > Please also tag the alert with your team name if not already done. cron/systemd set up originally in rOPUP04312588dddea2d53b79c2ad476... [15:19:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1114.eqiad.wmnet [15:20:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1115.eqiad.wmnet [15:21:00] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:08] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10thcipriani) Approved! Reason for access makes sense. [15:21:32] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10thcipriani) Approved! Reason for access makes sense. [15:24:16] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:26:34] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:58] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1115.eqiad.wmnet [15:27:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1116.eqiad.wmnet [15:28:28] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:29:34] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1530). [15:30:55] (03PS5) 10Elukey: profile::prometheus::k8s: add k8s-pods-kserve config [puppet] - 10https://gerrit.wikimedia.org/r/964551 (https://phabricator.wikimedia.org/T348456) [15:31:38] !log installing qemu security updates on bookworm [15:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1116.eqiad.wmnet [15:34:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1117.eqiad.wmnet [15:37:46] 10ops-eqiad, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009}.eqiad.wmnet - https://phabricator.wikimedia.org/T348144 (10elukey) [15:38:40] 10ops-codfw, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10elukey) [15:40:02] 10ops-codfw, 10Machine-Learning-Team, 10decommission-hardware: decommission ores{2001..2009}.codfw.wmnet - https://phabricator.wikimedia.org/T348462 (10elukey) The decom cookbook was run as part of T348144 [15:40:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:57] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1117.eqiad.wmnet [15:42:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1118.eqiad.wmnet [15:45:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1118.eqiad.wmnet [15:49:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1119.eqiad.wmnet [15:49:26] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:36] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:52:02] 10SRE-Sprint-Week-Sustainability-March2023, 10Znuny, 10collaboration-services, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10LSobanski) 05Open→03Resolved [15:52:04] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:52:24] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:55:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1119.eqiad.wmnet [15:55:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1120.eqiad.wmnet [15:58:51] (03PS1) 10Subramanya Sastry: parsoid-rt-client: Increase worker pool to 20 clients [puppet] - 10https://gerrit.wikimedia.org/r/964560 (https://phabricator.wikimedia.org/T345220) [16:03:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1120.eqiad.wmnet [16:03:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1121.eqiad.wmnet [16:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.797335169667635s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:09:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.1653050097578785s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:11:08] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:11:13] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:11:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1121.eqiad.wmnet [16:11:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1122.eqiad.wmnet [16:16:02] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 130 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:17:52] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1122.eqiad.wmnet [16:18:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1123.eqiad.wmnet [16:21:30] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:26:09] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1123.eqiad.wmnet [16:26:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1124.eqiad.wmnet [16:26:30] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:26:42] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:36] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) [16:27:43] 10SRE, 10serviceops-radar, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10lmata) [16:28:15] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q2), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10lmata) [16:29:30] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:40] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:32:33] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1124.eqiad.wmnet [16:32:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1125.eqiad.wmnet [16:34:52] (03PS1) 10Ebernhardson: admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) [16:38:25] (03CR) 10DCausse: [C: 03+1] admin: Add cirrus-streaming-updater namespace to flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/964567 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [16:39:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1125.eqiad.wmnet [16:40:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1126.eqiad.wmnet [16:41:24] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:prometheus::cloud: make openstack deployment a parameter [puppet] - 10https://gerrit.wikimedia.org/r/964540 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [16:47:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1126.eqiad.wmnet [16:47:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1127.eqiad.wmnet [16:56:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1127.eqiad.wmnet [16:56:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1128.eqiad.wmnet [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1700) [17:00:04] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T1700). [17:03:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1128.eqiad.wmnet [17:04:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1129.eqiad.wmnet [17:05:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:05:38] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:11:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1129.eqiad.wmnet [17:11:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1130.eqiad.wmnet [17:13:14] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964570 [17:14:33] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964570 (owner: 10Ladsgroup) [17:14:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964570 (owner: 10Ladsgroup) [17:15:12] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964570 (owner: 10Ladsgroup) [17:15:29] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:964570|Update interwiki cache]] [17:20:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1130.eqiad.wmnet [17:20:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1131.eqiad.wmnet [17:21:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:03] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:964570|Update interwiki cache]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:25:32] (03CR) 10Muehlenhoff: [C: 03+2] parsoid-rt-client: Increase worker pool to 20 clients [puppet] - 10https://gerrit.wikimedia.org/r/964560 (https://phabricator.wikimedia.org/T345220) (owner: 10Subramanya Sastry) [17:26:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1131.eqiad.wmnet [17:27:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1132.eqiad.wmnet [17:35:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1132.eqiad.wmnet [17:35:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1133.eqiad.wmnet [17:40:56] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:30] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1133.eqiad.wmnet [17:42:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1134.eqiad.wmnet [17:49:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1134.eqiad.wmnet [17:49:39] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1135.eqiad.wmnet [17:58:46] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1135.eqiad.wmnet [17:58:49] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1136.eqiad.wmnet [18:08:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1136.eqiad.wmnet [18:08:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1137.eqiad.wmnet [18:15:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1137.eqiad.wmnet [18:15:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1138.eqiad.wmnet [18:24:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1138.eqiad.wmnet [18:24:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1139.eqiad.wmnet [18:32:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1139.eqiad.wmnet [18:32:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1140.eqiad.wmnet [18:35:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:20] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@c334eaf]: (no justification provided) [18:36:33] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@c334eaf]: (no justification provided) (duration: 01m 12s) [18:39:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1140.eqiad.wmnet [18:39:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1141.eqiad.wmnet [18:45:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1141.eqiad.wmnet [18:46:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1142.eqiad.wmnet [18:48:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:02] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:51:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 37.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:54:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1142.eqiad.wmnet [18:54:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1143.eqiad.wmnet [18:55:37] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:964570|Update interwiki cache]] (duration: 100m 07s) [18:56:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.09% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:56:54] > (duration: 100m 07s) [18:56:54] That's me being an idiot and forgetting I'm at middle of deploy and did other work and it was stuck at mwdebug for at least half an hour [18:57:50] PROBLEM - php7.4-fpm service on mw1371 is CRITICAL: CRITICAL - Expecting active but unit php7.4-fpm is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:58:36] PROBLEM - Check systemd state on mw1371 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1143.eqiad.wmnet [19:01:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1144.eqiad.wmnet [19:08:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1144.eqiad.wmnet [19:08:42] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1145.eqiad.wmnet [19:16:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1145.eqiad.wmnet [19:16:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1146.eqiad.wmnet [19:23:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1146.eqiad.wmnet [19:23:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1147.eqiad.wmnet [19:24:16] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:27:32] RECOVERY - php7.4-fpm service on mw1371 is OK: OK - php7.4-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:18] RECOVERY - Check systemd state on mw1371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:32:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:32:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T343198)', diff saved to https://phabricator.wikimedia.org/P52868 and previous config saved to /var/cache/conftool/dbconfig/20231009-193219-arnaudb.json [19:32:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1147.eqiad.wmnet [19:32:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1148.eqiad.wmnet [19:32:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:34:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1148.eqiad.wmnet [19:40:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1149.eqiad.wmnet [19:46:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:40] PROBLEM - WDQS SPARQL on wdqs2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:47:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1149.eqiad.wmnet [19:47:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1150.eqiad.wmnet [19:49:30] RECOVERY - WDQS SPARQL on wdqs2015 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:51:05] (03PS1) 10Kosta Harlan: ReportIncident: Set developer mode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/964575 [19:54:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1150.eqiad.wmnet [19:54:24] (03PS2) 10EoghanGaffney: [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 [19:54:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet [19:57:00] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Handle runner pausing exceptions [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:15] hi, I'd like to add https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/964575 [20:00:24] I will add it to the calendar [20:00:30] today is a no-deploy day unfortunately due to the US holiday [20:00:36] oh right [20:00:42] then, I will leave it for tomorrow :) [20:02:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet [20:02:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1152.eqiad.wmnet [20:09:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1152.eqiad.wmnet [20:09:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1153.eqiad.wmnet [20:17:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1153.eqiad.wmnet [20:17:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1154.eqiad.wmnet [20:26:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1154.eqiad.wmnet [20:26:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1155.eqiad.wmnet [20:34:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1155.eqiad.wmnet [20:34:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1156.eqiad.wmnet [20:42:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1156.eqiad.wmnet [20:49:26] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:52:14] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:57:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:00:06] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231009T2100). [21:02:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:02:24] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:02:39] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:18:52] (03CR) 10Volans: [gitlab/failover] Handle runner pausing exceptions (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/964523 (owner: 10EoghanGaffney) [21:59:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:08:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T343198)', diff saved to https://phabricator.wikimedia.org/P52869 and previous config saved to /var/cache/conftool/dbconfig/20231009-220848-arnaudb.json [22:08:53] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:15:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:15:24] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:18:20] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:21:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:23:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P52870 and previous config saved to /var/cache/conftool/dbconfig/20231009-222354-arnaudb.json [22:39:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P52871 and previous config saved to /var/cache/conftool/dbconfig/20231009-223900-arnaudb.json [22:48:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T343198)', diff saved to https://phabricator.wikimedia.org/P52872 and previous config saved to /var/cache/conftool/dbconfig/20231009-225407-arnaudb.json [22:54:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [22:54:12] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:54:23] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [22:54:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T343198)', diff saved to https://phabricator.wikimedia.org/P52873 and previous config saved to /var/cache/conftool/dbconfig/20231009-225429-arnaudb.json [23:09:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:10:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:14:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:29:01] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale