[00:09:33] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1063.eqiad.wmnet with OS bullseye [00:30:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @cmooney hey i am working on 2 nodes cloudvirt1063 and 64 same rack E4 getting the message below. can you please see whu those nodes can not the... [00:38:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965141 [00:38:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965141 (owner: 10TrainBranchBot) [00:43:21] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:12] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965141 (owner: 10TrainBranchBot) [01:02:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:10:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:31:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:33] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:10:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:38:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:39] (03PS7) 10KartikMistry: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [03:03:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:03] (ProbeDown) firing: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:01] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:13:03] (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:19:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:19:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:27:49] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:37:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:46:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:00:43] PROBLEM - cassandra-a CQL 10.192.48.68:9042 on restbase2012 is CRITICAL: connect to address 10.192.48.68 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [04:01:07] PROBLEM - cassandra-a SSL 10.192.48.68:7000 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:52:11] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::prometheus::k8s: drop unused labels for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965178 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T0600) [06:00:05] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T0600). [06:04:18] (03Abandoned) 10Andrea Denisse: webperf: Move navtiming stats to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [06:28:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:46:09] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [06:52:25] (03PS1) 10Muehlenhoff: Remove access for Arturo [puppet] - 10https://gerrit.wikimedia.org/r/965390 [06:56:21] (03PS1) 10Elukey: profile::prometheus::k8s: fix drop label rule for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965392 (https://phabricator.wikimedia.org/T348456) [06:56:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for Arturo [puppet] - 10https://gerrit.wikimedia.org/r/965390 (owner: 10Muehlenhoff) [06:58:34] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Arturo Borrero Gonzalez out of all services on: 2156 hosts [06:59:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Arturo Borrero Gonzalez out of all services on: 2156 hosts [07:00:06] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T0700) [07:00:17] morning! no trainees have signed up to learn about deployment this morning, and there's no patches scheduled for deployment in any case. so... I'll just wish you a pleasant day and a good rest of your week, see you next time! [07:00:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44008/console" [puppet] - 10https://gerrit.wikimedia.org/r/965392 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [07:02:39] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::prometheus::k8s: fix drop label rule for k8s-pods-kserve [puppet] - 10https://gerrit.wikimedia.org/r/965392 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [07:03:38] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.3530610490707105s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:07:28] (03PS1) 10Muehlenhoff: Remove Arturo from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/965394 [07:08:07] (03PS1) 10Elukey: Revert "profile::prometheus::k8s: fix drop label rule for k8s-pods-kserve" [puppet] - 10https://gerrit.wikimedia.org/r/965216 [07:08:20] (03PS1) 10MPGuy2824: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) [07:08:38] sorry folks I temporary made upset one or two prometheus k8s masters [07:08:44] going to revert and fix in a sec [07:11:27] (03CR) 10Elukey: [C: 03+2] Revert "profile::prometheus::k8s: fix drop label rule for k8s-pods-kserve" [puppet] - 10https://gerrit.wikimedia.org/r/965216 (owner: 10Elukey) [07:12:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.591758853494321s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:13:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove Arturo from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/965394 (owner: 10Muehlenhoff) [07:14:58] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: block dockerhub on Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/965157 (https://phabricator.wikimedia.org/T320730) (owner: 10Jelto) [07:20:29] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:21:55] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:23:23] (03PS1) 10Elukey: profile::prometheus::k8s: fix (part 2) k8s-pods-kserve's label drop [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) [07:24:59] (03PS1) 10Majavah: P:toolforge: provision root sudo policy via here [puppet] - 10https://gerrit.wikimedia.org/r/965400 [07:26:15] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:27:43] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:37:47] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:38:37] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [07:42:05] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.036001430466314s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:47:15] (03PS1) 10Stevemunene: Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) [07:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.036001430466314s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:49:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:50:20] (03PS1) 10Majavah: security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 [07:50:45] (03CR) 10CI reject: [V: 04-1] security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [07:51:16] (03PS2) 10Majavah: security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 [07:53:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44009/console" [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [07:54:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:54:59] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:57:51] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:45] (03PS4) 10Volans: svc records: add missing comments for reserved IPs [dns] - 10https://gerrit.wikimedia.org/r/965119 [07:59:31] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [07:59:53] (03CR) 10Volans: svc records: add missing comments for reserved IPs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/965119 (owner: 10Volans) [08:00:06] hashar and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T0800). [08:02:27] o/ [08:02:32] I am checking the logs [08:03:28] (03PS5) 10Volans: cookbooks: acquire lock for each cookbook run [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) [08:07:47] (03PS1) 10Muehlenhoff: Remove Arturo from Icinga contact list [puppet] - 10https://gerrit.wikimedia.org/r/965463 [08:08:04] (03PS2) 10Elukey: profile::prometheus::k8s: fix (part 2) k8s-pods-kserve's label drop [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) [08:08:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove Arturo from Icinga contact list [puppet] - 10https://gerrit.wikimedia.org/r/965463 (owner: 10Muehlenhoff) [08:09:21] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:11:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44010/console" [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [08:14:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ayounsi) @Papaul can you ping us when you're around so we can look into it? Did you check the vlan config on the switch? Is it not able to reach anything e... [08:15:06] (03PS3) 10Elukey: profile::prometheus::k8s: fix (part 2) k8s-pods-kserve's label drop [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) [08:15:51] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:18:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44011/console" [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [08:19:19] (03PS3) 10KartikMistry: Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) [08:19:59] logs look good beside a couple issues that are corner cases due to user input [08:20:01] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [08:20:03] so will proceed with the train [08:21:18] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965465 (https://phabricator.wikimedia.org/T347081) [08:21:20] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965465 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:22:03] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965465 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [08:25:35] (03CR) 10JMeybohm: [C: 03+1] "Thanks! LGTM apart from the typo" [puppet] - 10https://gerrit.wikimedia.org/r/965229 (owner: 10Majavah) [08:26:31] (03PS2) 10Majavah: helmfile: Cleanup chart pull timer [puppet] - 10https://gerrit.wikimedia.org/r/965229 [08:27:13] (03CR) 10Majavah: [C: 03+2] helmfile: Cleanup chart pull timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965229 (owner: 10Majavah) [08:28:04] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44012/console" [puppet] - 10https://gerrit.wikimedia.org/r/965229 (owner: 10Majavah) [08:28:39] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.30 refs T347081 [08:28:44] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [08:30:43] (03CR) 10JMeybohm: [C: 03+2] Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [08:31:21] [895b24ef6778d6a75007f783] [no req] Error: Typed property GrowthExperiments\NewcomerTasks\AddLink\LinkRecommendationUpdater::$linkRecommendationTaskType must not be accessed before initialization [08:31:26] so well, rolling back [08:31:34] (03Merged) 10jenkins-bot: Add appserver, api and jobrunner SANs to mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/965227 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [08:32:23] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::k8s: fix (part 2) k8s-pods-kserve's label drop [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [08:34:39] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:52] !log add 200G to prometheus/ops in eqiad [08:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:18] (03CR) 10Ayounsi: [C: 03+1] hiera: announce ns0 IP from bird (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [08:38:00] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [08:38:17] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [08:38:25] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [08:38:41] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [08:39:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 38195 [08:40:14] (03PS1) 10Hashar: Revert "group2 wikis to 1.41.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965467 (https://phabricator.wikimedia.org/T347081) [08:40:17] (03CR) 10Hashar: [C: 03+2] Revert "group2 wikis to 1.41.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965467 (https://phabricator.wikimedia.org/T347081) (owner: 10Hashar) [08:40:43] that rollback is somehow taking ages :-\ [08:40:48] * hashar blames Docker [08:40:50] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 38195 [08:41:13] (03Merged) 10jenkins-bot: Revert "group2 wikis to 1.41.0-wmf.30" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965467 (https://phabricator.wikimedia.org/T347081) (owner: 10Hashar) [08:41:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 38195 [08:42:11] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:23] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:48] looks like something is depending on bandwith [08:42:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:42:55] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:03] hashar: could it be that it has to rebuild i18n for wmf.29 since it was dropped when moving all wikis to wmf.30? [08:44:19] (03PS1) 10David Caro: cloud_management: add am profile for silences [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) [08:44:19] the network / disk is too slow [08:44:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 38195 [08:44:37] the rollback involves pushing a 8GBytes image at 5MBytes/seconds [08:45:28] real 0m4.431s [08:45:35] (03CR) 10Majavah: [C: 04-1] "profile::alertmanager::api::rw is a hiera key, not a profile: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/head" [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [08:45:44] 08:44:26 Finished build-and-push-container-images (duration: 10m 21s) [08:45:44] :) [08:45:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 56099 [08:46:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:49:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 56099 [08:49:36] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965170 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [08:51:27] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:34] (03CR) 10Muehlenhoff: [C: 03+2] Extend acmechief config for new apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/965170 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [08:51:41] (03PS2) 10Muehlenhoff: Extend acmechief config for new apt hosts [puppet] - 10https://gerrit.wikimedia.org/r/965170 (https://phabricator.wikimedia.org/T331613) [08:53:07] (03PS2) 10David Caro: cloud_management: add cloudcumins to am api rw [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) [08:53:12] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.41.0-wmf.30" # T347081 [08:53:16] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [08:53:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44015/console" [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [08:56:19] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44016/console" [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [08:57:13] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:37] (03CR) 10David Caro: [V: 03+1] cloud_management: add cloudcumins to am api rw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [08:58:02] (03PS1) 10Elukey: profile::statistics::explorer:ml: expand model_upload.sh [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) [08:58:39] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:58:41] (03CR) 10Filippo Giunchedi: [C: 03+1] cloud_management: add cloudcumins to am api rw [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [09:00:08] (03PS1) 10Elukey: profile::statistics::explorer::ml: rename script to model-upload [puppet] - 10https://gerrit.wikimedia.org/r/965470 [09:01:31] (03CR) 10FNegri: [C: 03+1] cloud_management: add cloudcumins to am api rw [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [09:03:27] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::prometheus::k8s: fix (part 2) k8s-pods-kserve's label drop [puppet] - 10https://gerrit.wikimedia.org/r/965399 (https://phabricator.wikimedia.org/T348456) (owner: 10Elukey) [09:05:53] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:39] (03PS2) 10Elukey: profile::statistics::explorer:ml: expand model_upload.sh [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) [09:09:41] (03PS2) 10Elukey: profile::statistics::explorer::ml: rename script to model-upload [puppet] - 10https://gerrit.wikimedia.org/r/965470 [09:10:48] (03CR) 10Klausman: [C: 03+1] profile::statistics::explorer:ml: expand model_upload.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:11:37] RECOVERY - HTTPS on apt1002 is OK: SSL OK - OCSP staple validity for apt.wikimedia.org has 460102 seconds left:Certificate apt.wikimedia.org valid until 2023-11-30 23:21:52 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/APT_repository [09:11:41] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:11:51] RECOVERY - HTTP on apt1002 is OK: HTTP OK: HTTP/1.1 302 Moved Temporarily - 365 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/APT_repository [09:15:59] PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:59] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on an-master1002.eqiad.wmnet with reason: Rebooting misbehaving an-master1002 [09:17:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on an-master1002.eqiad.wmnet with reason: Rebooting misbehaving an-master1002 [09:18:00] (03CR) 10Ilias Sarantopoulos: [C: 03+1] profile::statistics::explorer:ml: expand model_upload.sh [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:18:24] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I agree!" [puppet] - 10https://gerrit.wikimedia.org/r/965470 (owner: 10Elukey) [09:19:19] (03PS1) 10JMeybohm: service_proxy: Add mw-wikifunctions-ro listener [puppet] - 10https://gerrit.wikimedia.org/r/965471 (https://phabricator.wikimedia.org/T347544) [09:20:02] (03PS1) 10Muehlenhoff: aptrepo: Create /srv/private/junos in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/965472 [09:20:19] RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:26] (03CR) 10Ayounsi: aptrepo: Create /srv/private/junos in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965472 (owner: 10Muehlenhoff) [09:23:33] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:06] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44017/console" [puppet] - 10https://gerrit.wikimedia.org/r/965471 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:26:57] (03CR) 10Majavah: [C: 03+1] cloud_management: add cloudcumins to am api rw [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [09:28:49] (03CR) 10Muehlenhoff: aptrepo: Create /srv/private/junos in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965472 (owner: 10Muehlenhoff) [09:30:30] (03PS2) 10Muehlenhoff: aptrepo: Create /srv/private/junos in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/965472 [09:31:18] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-master1002.eqiad.wmnet [09:31:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-master1002.eqiad.wmnet [09:31:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1002.eqiad.wmnet [09:33:44] (03PS1) 10JMeybohm: wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) [09:34:37] (03CR) 10CI reject: [V: 04-1] wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:35:44] (03PS2) 10JMeybohm: wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) [09:36:38] (03CR) 10CI reject: [V: 04-1] wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:37:08] (03PS3) 10Elukey: profile::statistics::explorer:ml: expand model_upload.sh [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) [09:37:10] (03PS3) 10Elukey: profile::statistics::explorer::ml: rename script to model-upload [puppet] - 10https://gerrit.wikimedia.org/r/965470 [09:37:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1002.eqiad.wmnet [09:37:32] (03CR) 10Elukey: profile::statistics::explorer:ml: expand model_upload.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:38:50] (03CR) 10JMeybohm: "CI fails because the listener does not exist yet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:40:20] !log repooling cp4040 (depooled for T347837 and forgot) [09:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] T347837: Deploy new purged version with UDS feature - https://phabricator.wikimedia.org/T347837 [09:42:35] (03CR) 10Hnowlan: [C: 03+1] service_proxy: Add mw-wikifunctions-ro listener [puppet] - 10https://gerrit.wikimedia.org/r/965471 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:42:37] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer:ml: expand model_upload.sh [puppet] - 10https://gerrit.wikimedia.org/r/965469 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [09:42:44] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer::ml: rename script to model-upload [puppet] - 10https://gerrit.wikimedia.org/r/965470 (owner: 10Elukey) [09:46:42] (03PS1) 10Elukey: profile::statistics::explorer::ml: change owner of published dir [puppet] - 10https://gerrit.wikimedia.org/r/965474 (https://phabricator.wikimedia.org/T347838) [09:49:43] (03CR) 10Ayounsi: [C: 03+1] aptrepo: Create /srv/private/junos in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/965472 (owner: 10Muehlenhoff) [09:50:44] (03CR) 10Hnowlan: [C: 03+1] wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [09:58:48] (03CR) 10Elukey: APIGW: add entry for llm langid LW isvc (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [09:59:26] (03CR) 10Elukey: [C: 03+2] services: upgrade Docker image for eventstreams services [deployment-charts] - 10https://gerrit.wikimedia.org/r/964848 (https://phabricator.wikimedia.org/T343511) (owner: 10Elukey) [10:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1000) [10:01:02] (03PS1) 10David Caro: metricsinfra.alertmanager: add victorops and paging route [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) [10:03:20] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: sync [10:12:25] (03PS2) 10David Caro: metricsinfra.alertmanager: add victorops and paging route [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) [10:12:56] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Create /srv/private/junos in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/965472 (owner: 10Muehlenhoff) [10:13:27] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync [10:15:57] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: sync [10:20:33] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [10:24:32] (03CR) 10David Caro: [V: 03+1 C: 03+2] cloud_management: add cloudcumins to am api rw [puppet] - 10https://gerrit.wikimedia.org/r/965468 (https://phabricator.wikimedia.org/T347490) (owner: 10David Caro) [10:25:18] 10SRE, 10Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730 (10MoritzMuehlenhoff) [10:25:20] (03PS1) 10Hnowlan: rest-gateway: fix remaining rewrite path [deployment-charts] - 10https://gerrit.wikimedia.org/r/965478 (https://phabricator.wikimedia.org/T347027) [10:25:41] 10SRE, 10Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:25:51] (03PS2) 10Ilias Sarantopoulos: ml-services: add langid in llm namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) [10:26:04] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync [10:26:43] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: sync [10:26:47] (03CR) 10Ilias Sarantopoulos: "just realized I wrote the responses yesterday without pushing the new patch. Sorry 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965189 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [10:26:54] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync [10:27:31] (03PS2) 10Ilias Sarantopoulos: service: Add entry for llm langid for Lift Wing in the api-gw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) [10:27:46] (03CR) 10Ilias Sarantopoulos: service: Add entry for llm langid for Lift Wing in the api-gw config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [10:28:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:34:18] (03PS1) 10Ilias Sarantopoulos: ml-services: update revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) [10:34:44] (03PS6) 10Majavah: dnsrecursor: remove need to run labs-ip-alias-dump twice [puppet] - 10https://gerrit.wikimedia.org/r/960164 [10:35:26] (03CR) 10Majavah: [C: 03+2] dnsrecursor: remove need to run labs-ip-alias-dump twice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960164 (owner: 10Majavah) [10:36:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-aborrero: Add support for nftables in profile::firewall - https://phabricator.wikimedia.org/T336497 (10MoritzMuehlenhoff) 05Open→03Resolved The ganeti test cluster, cloudgw and the sretest hosts are using nftables. This completes the initi... [10:37:17] (03PS1) 10Elukey: services: force ipv4 for eventstreams when using the local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/965481 (https://phabricator.wikimedia.org/T347477) [10:39:13] (03CR) 10David Caro: "@godog I tested this in our current setup, and it paged correctly, but what should happen when I ack the alert on alertmanager side? shoul" [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [10:43:25] (03CR) 10Majavah: [C: 04-1] metricsinfra.alertmanager: add victorops and paging route (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [10:44:29] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) It seems some of these files are available i... [10:46:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10KOfori) [10:46:25] (03CR) 10Elukey: [C: 03+2] services: force ipv4 for eventstreams when using the local tls proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/965481 (https://phabricator.wikimedia.org/T347477) (owner: 10Elukey) [10:46:58] 10SRE, 10Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734 (10MoritzMuehlenhoff) [10:47:08] 10SRE, 10Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:48:22] 10SRE, 10Infrastructure-Foundations: Make notrack available for nftables - https://phabricator.wikimedia.org/T348735 (10MoritzMuehlenhoff) [10:48:29] 10SRE, 10Infrastructure-Foundations: Make notrack available for nftables - https://phabricator.wikimedia.org/T348735 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:49:45] 10SRE, 10Infrastructure-Foundations: Adapt firewall logging for nftables - https://phabricator.wikimedia.org/T348736 (10MoritzMuehlenhoff) [10:49:55] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: sync [10:50:23] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: sync [10:51:56] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: sync [10:52:06] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: sync [10:56:38] (03CR) 10Elukey: [C: 03+2] profile::statistics::explorer::ml: change owner of published dir [puppet] - 10https://gerrit.wikimedia.org/r/965474 (https://phabricator.wikimedia.org/T347838) (owner: 10Elukey) [10:57:36] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] service_proxy: Add mw-wikifunctions-ro listener [puppet] - 10https://gerrit.wikimedia.org/r/965471 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:58:21] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [10:59:39] (03PS1) 10Muehlenhoff: Remove obsolete Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/965484 (https://phabricator.wikimedia.org/T156955) [11:09:22] (03PS9) 10Majavah: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [11:09:24] (03PS11) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [11:13:39] 10SRE, 10Infrastructure-Foundations, 10netops: Tighter control on exported BGP routes from MRs - https://phabricator.wikimedia.org/T348739 (10cmooney) p:05Triage→03Low [11:18:29] (03CR) 10Majavah: [C: 03+2] dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [11:19:02] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [11:19:49] (03Merged) 10jenkins-bot: wikifunctions: Switch to use mw-wikifunctions for API calls [deployment-charts] - 10https://gerrit.wikimedia.org/r/965473 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [11:20:48] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:21:17] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:27:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [11:27:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [11:28:55] (03PS1) 10Cathal Mooney: Tighter control of BGP IP export from management routers [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) [11:29:53] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:11] ^ that's me, silencing, running some tests [11:30:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: testing [11:30:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: testing [11:32:47] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:29] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:34:29] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:36:21] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:37:13] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:37:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44018/console" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:37:34] (03CR) 10Majavah: [C: 03+2] P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [11:40:09] (03PS3) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) [11:40:11] (03PS1) 10Stevemunene: Switch druid1004 zookeeper node with druid1009 [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) [11:40:13] (03PS1) 10Stevemunene: Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) [11:40:15] (03PS1) 10Stevemunene: Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042) [11:41:01] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, and 2 others: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [11:49:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:49:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:49:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [11:49:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1200) [12:05:06] (03PS2) 10Ilias Sarantopoulos: ml-services: update revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) [12:05:22] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44019/console" [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:06:31] (03PS3) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) [12:10:16] (03PS1) 10Phuedx: LinkRecommendationUpdater: Update $linkRecommendationTaskType declaration [extensions/GrowthExperiments] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965217 (https://phabricator.wikimedia.org/T348719) [12:15:33] 10SRE, 10Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734 (10Volans) Mentioning T348525 too to avoid duplicate work. [12:16:06] !log disable puppet on A:cp-text - T347544 [12:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:10] T347544: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 [12:16:40] starting decommission of restbase2012-b — T328490 [12:16:41] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [12:17:27] (03CR) 10Ilias Sarantopoulos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [12:18:32] !log disable puppet on A:cp - T347544 [12:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:07] (03CR) 10JMeybohm: [C: 03+2] wikifunctions: Add routing to separate mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965056 (https://phabricator.wikimedia.org/T347544) (owner: 10Clément Goubert) [12:26:48] !log re-enable puppet on A:cp - T347544 [12:26:52] (03CR) 10Btullis: [C: 03+1] Change the kafka-jumbo bootstrap host for karapace [puppet] - 10https://gerrit.wikimedia.org/r/965159 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:53] T347544: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 [12:27:09] (03CR) 10Btullis: [C: 03+1] Change the kafka-jumbo bootstrap host for the analytics cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/965160 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:27:40] (03CR) 10Btullis: [C: 03+1] Change the underlying host for the kafka-jumbo-canary cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/965161 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:28:53] (03CR) 10Brouberol: [C: 03+2] Change the underlying host for the kafka-jumbo-canary cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/965161 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:29:20] (03CR) 10Brouberol: [C: 03+2] Change the kafka-jumbo bootstrap host for the analytics cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/965160 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:29:22] (03CR) 10Brouberol: [C: 03+2] Change the kafka-jumbo bootstrap host for karapace [puppet] - 10https://gerrit.wikimedia.org/r/965159 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:47:01] PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:07] (03CR) 10Klausman: [C: 03+1] ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [12:47:56] (03CR) 10Klausman: [C: 03+1] service: Add entry for llm langid for Lift Wing in the api-gw config [deployment-charts] - 10https://gerrit.wikimedia.org/r/965191 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [12:50:09] (03CR) 10Hashar: [C: 03+2] "That was fast, thank you! I am going to deploy the patch and roll the train once done." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965217 (https://phabricator.wikimedia.org/T348719) (owner: 10Phuedx) [12:54:17] RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:43] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [12:58:26] (03CR) 10Ayounsi: Tighter control of BGP IP export from management routers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) (owner: 10Cathal Mooney) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1300) [13:00:05] kart_ and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] ohhh the backport window I forgot about it :/ [13:00:37] I have a hotfix in the pipe https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/965217/ [13:00:45] and I will run the train AFTER the backport window [13:03:58] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) 05In progress→03Resolved All wikifunctions.org traffic from the edge as well as from function-orchestrator is now served by th... [13:07:24] (03Merged) 10jenkins-bot: LinkRecommendationUpdater: Update $linkRecommendationTaskType declaration [extensions/GrowthExperiments] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965217 (https://phabricator.wikimedia.org/T348719) (owner: 10Phuedx) [13:07:25] hashar: sorry, was went off keyboard for 1 minute and it became 6 minutes ;) [13:07:58] hashar: let me know once hotfix is done [13:11:11] I am deploying it [13:11:18] well the code [13:11:34] so you can already +2 the changes you have [13:11:55] !log hashar@deploy2002 Started scap: Backport for [[gerrit:965217|LinkRecommendationUpdater: Update $linkRecommendationTaskType declaration (T348719)]] [13:11:58] the fix verification for the patch I have merged would need the group 2 wikis to be switched to 1.41.0-wmf.30 , but I will do that after the backport window [13:12:02] T348719: Error: Typed property GrowthExperiments\NewcomerTasks\AddLink\LinkRecommendationUpdater::$linkRecommendationTaskType must not be accessed before initialization - https://phabricator.wikimedia.org/T348719 [13:13:15] !log hashar@deploy2002 phuedx and hashar: Backport for [[gerrit:965217|LinkRecommendationUpdater: Update $linkRecommendationTaskType declaration (T348719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:13:16] !log hashar@deploy2002 phuedx and hashar: Continuing with sync [13:13:41] (03PS2) 10Cathal Mooney: Tighter control of BGP IP export from management routers [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) [13:15:22] (03CR) 10Cathal Mooney: Tighter control of BGP IP export from management routers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) (owner: 10Cathal Mooney) [13:15:42] hashar: mine is config change, so that's fine along with when I starts deployment. [13:16:04] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) (owner: 10Cathal Mooney) [13:16:38] kart_: great, php fpm are being restarted [13:16:44] (03CR) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [13:16:58] (03CR) 10Cathal Mooney: [C: 03+2] Tighter control of BGP IP export from management routers [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) (owner: 10Cathal Mooney) [13:17:33] (03PS5) 10Muehlenhoff: Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [13:17:44] (03Merged) 10jenkins-bot: Tighter control of BGP IP export from management routers [homer/public] - 10https://gerrit.wikimedia.org/r/965491 (https://phabricator.wikimedia.org/T348739) (owner: 10Cathal Mooney) [13:18:47] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:965217|LinkRecommendationUpdater: Update $linkRecommendationTaskType declaration (T348719)]] (duration: 06m 51s) [13:18:55] T348719: Error: Typed property GrowthExperiments\NewcomerTasks\AddLink\LinkRecommendationUpdater::$linkRecommendationTaskType must not be accessed before initialization - https://phabricator.wikimedia.org/T348719 [13:19:11] kart_: all your! [13:19:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 40317 [13:19:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40317 [13:20:04] (03CR) 10CI reject: [V: 04-1] Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [13:22:35] hashar: thanks! [13:23:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org [13:23:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [13:23:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15133 [13:23:33] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:54] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [13:24:07] (03Merged) 10jenkins-bot: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [13:24:31] !log kartik@deploy2002 Started scap: Backport for [[gerrit:955007|Add Akan language (T333765)]] [13:24:45] (03CR) 10Bking: [V: 03+1] druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:24:53] (03PS6) 10Muehlenhoff: Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [13:25:08] T333765: Remove Akan support from MediaWiki, ULS, and Wikimedia servers - https://phabricator.wikimedia.org/T333765 [13:25:51] !log kartik@deploy2002 kartik and srishakatux: Backport for [[gerrit:955007|Add Akan language (T333765)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15133 [13:26:56] (03CR) 10Joal: [C: 03+1] "One not on if syntax, not mandatory at all :)" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:26:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org [13:27:18] (03CR) 10CI reject: [V: 04-1] Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [13:27:53] (03CR) 10Joal: [C: 03+1] Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:28:46] !log kartik@deploy2002 kartik and srishakatux: Continuing with sync [13:29:01] (03CR) 10Joal: [C: 03+1] Deploy multiple spark shuffler services to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:29:12] (03PS1) 10Vivian Rook: Remove gerrit git from quarry [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) [13:29:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 139901 [13:32:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 139901 [13:33:06] (03PS7) 10Muehlenhoff: Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [13:33:39] (03PS35) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:33:41] (03PS16) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [13:33:43] (03PS46) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [13:34:10] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:955007|Add Akan language (T333765)]] (duration: 09m 39s) [13:34:12] (03CR) 10Elukey: ml-services: update revscoring and enable articlequality mp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [13:34:15] T333765: Remove Akan support from MediaWiki, ULS, and Wikimedia servers - https://phabricator.wikimedia.org/T333765 [13:34:36] I'm also done with my change, hashar [13:34:44] great [13:35:43] !log remove redundant 208.80.153.231/32 from /e/n/i on A:dns-rec and A:codfw (superseded by label lo:anycast): T348041 [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 [13:36:45] (03CR) 10CI reject: [V: 04-1] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:37:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 54994 [13:37:33] (03PS1) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [13:37:38] (03CR) 10Muehlenhoff: Add a prometheus check for whether nftables is running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [13:37:57] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:37:59] (03CR) 10CI reject: [V: 04-1] C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [13:38:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994 [13:38:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44020/console" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:40:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/965484 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:41:23] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:42:39] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [13:42:47] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:05] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:40] !log remove old ns2 IP 91.198.174.239/32 from /e/n/i on A:dns-rec: T329219 [13:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:44] T329219: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 [13:45:45] (03PS1) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [13:46:18] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44021/console" [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:46:20] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [13:46:51] (03CR) 10Slyngshede: Add a prometheus check for whether nftables is running (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [13:47:12] (03CR) 10Btullis: Deploy multiple spark shuffler services to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:48:22] (03PS2) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [13:48:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @ayounsi yes i checked the the vlan config on the switch and confirmed that the interface is in the right vlan. The reason you can not ssh into the... [13:49:57] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:50:02] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:50:16] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:50:23] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:52:05] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:52:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8 DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44022/console" [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [13:52:21] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:53:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:53:53] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:54:11] I am doing the train now [13:55:24] (03CR) 10Btullis: "I notice that you've submitted this in the same chain as the VIP addresses." [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:55:45] (03PS3) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [13:56:27] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965518 (https://phabricator.wikimedia.org/T347081) [13:56:29] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965518 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [13:56:30] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Aklapper) I updated https://www.mediawiki.org/wiki/Extension:Graph/Graphoid accordingly. [13:56:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44025/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [13:57:11] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965518 (https://phabricator.wikimedia.org/T347081) (owner: 10TrainBranchBot) [13:59:11] (03PS4) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [14:00:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [14:00:19] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44026/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [14:02:10] 10SRE-swift-storage, 10TimedMediaHandler, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10hashar) I thought it was due to some bad input, but the code t... [14:03:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:03:32] (03PS5) 10Slyngshede: C:prometheus::ethtool_export Add ethtool exporter. [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) [14:03:52] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.30 refs T347081 [14:04:10] T347081: 1.41.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T347081 [14:04:15] !log sudo cumin -b1 -s120 'A:dns-rec and not P{dns6002*}' 'systemctl restart pdns-recursor.service' [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44027/console" [puppet] - 10https://gerrit.wikimedia.org/r/965515 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [14:05:10] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Eevans) >>! In T348142#9231437, @Papaul wrote: > @Eevans hello when do you think it will be the best day for us to coordinate with you on relocating this node so that we are not block by it during the codfw sw... [14:07:10] (03PS2) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [14:07:12] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:07:19] (03CR) 10David Caro: Remove gerrit git from quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [14:07:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:07:42] (03CR) 10Eevans: [C: 03+2] cassandra: add utility wrapper & instance symlinks for sstableutil [puppet] - 10https://gerrit.wikimedia.org/r/964072 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [14:07:45] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:10:05] (03PS3) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [14:10:38] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:11:27] (03PS1) 10Ayounsi: esams: update v4 infra prefix from LVS prefix [puppet] - 10https://gerrit.wikimedia.org/r/965519 [14:11:44] !log mwmaint2002: stop previous instance of `refreshLinkRecommendations` maintenance job (T348719) [14:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:57] T348719: Error: Typed property GrowthExperiments\NewcomerTasks\AddLink\LinkRecommendationUpdater::$linkRecommendationTaskType must not be accessed before initialization - https://phabricator.wikimedia.org/T348719 [14:12:12] hashar: fyi ^^. i've stopped the pre-existing job at mwmaint, so at 14:27 UTC, the new code will be used (hopefully). [14:12:26] urbanecm: ah great thank you! [14:12:39] otherwise it'd notice it's running already, and probably end up not doing anything. [14:12:46] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bookworm [14:12:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm [14:12:57] I guess cause updates already have been processed? [14:13:09] anyway it is tracked so if the issue occurs again we know where to look [14:13:50] the job runs for more than an hour, and afaics the timer doesn't start the new one if the previous instance's still running [14:13:56] (03CR) 10Cathal Mooney: [C: 03+1] esams: update v4 infra prefix from LVS prefix [puppet] - 10https://gerrit.wikimedia.org/r/965519 (owner: 10Ayounsi) [14:14:07] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s2.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service,mediawiki_job_growthexperiments-refreshLinkRecommendations-s6.service,mediawiki_job_growthexperiments-refreshLinkRecommendati [14:14:07] ervice https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:37] ^^ that's the stop. will resolve when it starts again.^^ [14:15:10] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: sync [14:15:42] (03PS1) 10Eevans: cassandra: fix incorrect path to sstable utilities [puppet] - 10https://gerrit.wikimedia.org/r/965521 (https://phabricator.wikimedia.org/T346803) [14:16:03] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: sync [14:16:16] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: sync [14:16:19] (03CR) 10Eevans: [C: 03+2] cassandra: fix incorrect path to sstable utilities [puppet] - 10https://gerrit.wikimedia.org/r/965521 (https://phabricator.wikimedia.org/T346803) (owner: 10Eevans) [14:16:34] woo that's exciting [14:17:01] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: sync [14:20:01] (03CR) 10Ayounsi: [C: 03+2] esams: update v4 infra prefix from LVS prefix [puppet] - 10https://gerrit.wikimedia.org/r/965519 (owner: 10Ayounsi) [14:20:33] (03CR) 10Vivian Rook: Remove gerrit git from quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [14:22:17] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:28] (03CR) 10David Caro: Remove gerrit git from quarry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [14:23:50] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [14:25:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:27:07] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:30:33] PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:45] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1049 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:30:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [14:31:24] (03PS4) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [14:32:51] !log completed restarts of pdns-recursor in doh* and dns* [14:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:26] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: sync [14:33:53] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: sync [14:33:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6412 [14:34:00] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6412 [14:34:30] urbanecm: the GrowthExperiments error log for $linkRecommendationTaskType does not show up in the log so I am assuming it got fixed (or nothign is being run) [14:34:38] I am marking the task resolved [14:34:41] thanks! [14:35:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 46562 [14:35:44] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1002.eqiad.wmnet [14:35:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 [14:37:22] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:37:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15435 [14:38:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:38:23] PROBLEM - Check systemd state on kubernetes2017 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:33] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15435 [14:41:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1002.eqiad.wmnet [14:42:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 25542 [14:42:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 25542 [14:44:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15703 [14:44:39] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:44:50] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:45:10] (ThanosQueryInstantLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:45:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15703 [14:45:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 30132 [14:45:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1003.eqiad.wmnet [14:45:37] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:45:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 30132 [14:46:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3267 [14:46:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3267 [14:46:56] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudelastic1007 - jclark@cumin1001" [14:47:16] (03CR) 10Majavah: [C: 03+2] "retrying" [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah) [14:47:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudelastic1007 - jclark@cumin1001" [14:47:50] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398196 [14:48:33] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398196 [14:48:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 400474 [14:49:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 400474 [14:49:05] (03Merged) 10jenkins-bot: team-wmcs: ceph: cleanup summaries of existing alerts [alerts] - 10https://gerrit.wikimedia.org/r/965154 (owner: 10Majavah) [14:49:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 28458 [14:49:08] (03Merged) 10jenkins-bot: team-wmcs: ceph: add alert for slow ops [alerts] - 10https://gerrit.wikimedia.org/r/965155 (owner: 10Majavah) [14:49:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28458 [14:49:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12200 [14:50:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12200 [14:50:15] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:51:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:51:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35008 [14:52:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35008 [14:52:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1003.eqiad.wmnet [14:53:01] (03PS8) 10Muehlenhoff: Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [14:53:18] (03CR) 10Muehlenhoff: Add a prometheus check for whether nftables is running (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [14:53:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2003.codfw.wmnet [14:55:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:56:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye [14:56:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [14:56:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.wikimedia.org with OS bullseye [14:56:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [14:57:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye [14:57:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.... [14:57:45] !log stopping gdnsd on dns2006 to simulate bird prefix withdrawal [14:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:17] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:44] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF) Thank you! [14:59:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:59:35] (03Abandoned) 10Brion VIBBER: Disable older WebM VP8 transcodes except 360p [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802665 (https://phabricator.wikimedia.org/T309823) (owner: 10Brion VIBBER) [15:00:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16591 [15:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965141 (owner: 10TrainBranchBot) [15:00:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2003.codfw.wmnet [15:01:13] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:43] PROBLEM - AuthDNS-over-TLS Works on dns2006 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [15:04:13] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bookworm [15:04:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm executed with... [15:04:51] RECOVERY - AuthDNS-over-TLS Works on dns2006 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [15:07:06] (03PS5) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:07:35] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:07:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965144 [15:07:59] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965144 (owner: 10TrainBranchBot) [15:08:54] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [15:08:56] 10SRE-swift-storage, 10TimedMediaHandler, 10Regression, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10Aklapper) [15:09:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [15:10:10] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:11:34] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage [15:11:40] 10SRE-swift-storage, 10TimedMediaHandler, 10Regression, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10brion) Note there shouldn't be any streaming t... [15:13:48] 10SRE-swift-storage, 10TimedMediaHandler, 10Regression, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10brion) Oh I see it's saving empty playlists an... [15:14:38] 10SRE-swift-storage, 10TimedMediaHandler, 10Regression, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10brion) I could make these non-fatal errors I g... [15:14:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage [15:14:58] 10SRE, 10Infrastructure-Foundations: JSON Schema have depreciated jsonschema.RefResolver - https://phabricator.wikimedia.org/T348764 (10cmooney) p:05Triage→03Low [15:15:16] jouncebot: now [15:15:16] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [15:15:22] is anyone deploying at the moment? [15:15:27] (03PS6) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:15:29] I’d be interested in doing a small backport [15:15:37] (I’ll proceed in a few minutes if nobody objects) [15:15:42] Lucas_WMDE: Go for it. [15:16:03] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:16:36] so many UploadChunkFile errors in logspam-watch D: [15:16:55] (03PS2) 10Lucas Werkmeister (WMDE): specials: Use correct title in NewPagesPager [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965211 (https://phabricator.wikimedia.org/T348665) (owner: 10Jforrester) [15:16:56] !log lucaswerkmeister-wmde@deploy2002 Backport cancelled. [15:17:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965211 (https://phabricator.wikimedia.org/T348665) (owner: 10Jforrester) [15:17:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16591 [15:17:27] (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965098 (owner: 10FNegri) [15:18:48] (03PS7) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:19:22] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:19:50] 10SRE, 10Infrastructure-Foundations: JSON Schema have depreciated jsonschema.RefResolver - https://phabricator.wikimedia.org/T348764 (10cmooney) Adjusted tox.ini for the homer public repo for now to use jsonschema <4.18: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/965516 [15:24:46] (03PS8) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:25:19] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:26:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965144 (owner: 10TrainBranchBot) [15:26:41] 10SRE-swift-storage, 10TimedMediaHandler, 10Regression, 10Wikimedia-production-error: [026f63a8-bebd-49dd-a536-746796d71575] /w/api.php Exception: Errors saving HLS playlist LL-Q8097_(tel)-V_Bhavya-క్రొ.wav.m3u8 - https://phabricator.wikimedia.org/T348753 (10hashar) Ah great, thank you @brion. I guess yo... [15:27:45] (03PS4) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) [15:28:18] (03CR) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [15:30:10] (03CR) 10FNegri: [C: 03+2] wmcs::cloudlb: add cloud_production profile [puppet] - 10https://gerrit.wikimedia.org/r/965098 (owner: 10FNegri) [15:30:12] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:31:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [15:31:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.wikimedia.org with OS bullseye [15:31:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki... [15:31:45] (03PS1) 10Ayounsi: Permit anycast NTP from cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/965530 [15:32:00] (03PS1) 10MVernon: netboot: treat moss-be servers like new-style ms-be ones [puppet] - 10https://gerrit.wikimedia.org/r/965531 (https://phabricator.wikimedia.org/T342674) [15:32:19] (03CR) 10Elukey: [C: 03+1] "Left a nit, then please proceed :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [15:32:56] (03Merged) 10jenkins-bot: specials: Use correct title in NewPagesPager [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965211 (https://phabricator.wikimedia.org/T348665) (owner: 10Jforrester) [15:33:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @Papaul I was able to change netbox to Public Vlan redoing most of the steps for setting up... [15:33:26] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:965211|specials: Use correct title in NewPagesPager (T348665)]] [15:33:30] T348665: Translatable edit summary comments not displaying on Special:NewPages - https://phabricator.wikimedia.org/T348665 [15:33:45] (03CR) 10Ssingh: Permit anycast NTP from cloud-hosts (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:34:28] (03PS9) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:34:41] !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Backport for [[gerrit:965211|specials: Use correct title in NewPagesPager (T348665)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:34:57] testing [15:35:01] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:35:18] seems to fix the issue [15:35:19] !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Continuing with sync [15:35:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, and 2 others: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) @Papaul thanks very much! I think I understand the grub-install problem (see the CR I just opened). [15:37:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, and 2 others: Q1:rack/setup/install moss-be1003 - https://phabricator.wikimedia.org/T342675 (10MatthewVernon) @Jclark-ctr are you OK to JBOD the disks on this system, please? i.e. make them all non-RAID devices rather than single-drive RAID-0 arrays. [the... [15:39:10] (03PS1) 10AikoChou: ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965532 (https://phabricator.wikimedia.org/T347550) [15:39:16] (03PS1) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [15:39:31] (03PS2) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [15:39:57] (03CR) 10CI reject: [V: 04-1] Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [15:40:30] (03CR) 10Ayounsi: Permit anycast NTP from cloud-hosts (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:40:47] (03CR) 10Elukey: [C: 03+1] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965532 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [15:41:18] (03CR) 10Ssingh: [C: 03+1] Permit anycast NTP from cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:41:34] hmmm [15:41:41] (03PS3) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [15:41:43] scap failed with socket.timeout: timed out [15:41:45] but still exited zero [15:41:54] (03CR) 10Ayounsi: [C: 03+2] Permit anycast NTP from cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:42:07] (03CR) 10Ayounsi: [C: 03+2] Permit anycast NTP from cloud-hosts (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:42:17] (03CR) 10David Caro: metricsinfra.alertmanager: add victorops and paging route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [15:42:28] (03Merged) 10jenkins-bot: Permit anycast NTP from cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/965530 (owner: 10Ayounsi) [15:42:33] !log (mostly?) Finished scap: Backport for [[gerrit:965211|specials: Use correct title in NewPagesPager (T348665)]] (duration: 07m 13s) – scap failed in the purgeMessageBlobStore step (php-fpm-restarts finished) [15:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:42] not entirely sure what to do [15:42:45] T348665: Translatable edit summary comments not displaying on Special:NewPages - https://phabricator.wikimedia.org/T348665 [15:42:49] * Lucas_WMDE looks up that script [15:42:58] (03PS4) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [15:43:37] “purge the MessageBlobStore cache” [15:43:43] I doubt that’s necessary, this patch didn’t touch i18n [15:44:12] !log installing libxpm security updates [15:44:13] oh wait, did it crash *during* that step… or did it crash *after* it, while trying to log to logmsgbot? 🤔 [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:18] (03CR) 10AikoChou: [C: 03+2] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965532 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [15:44:53] here’s the error that happened: https://phabricator.wikimedia.org/P52918 [15:45:09] (03Merged) 10jenkins-bot: ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965532 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [15:45:10] I think the “socket timeout” refers to the connection to logmsgbot [15:45:16] so it’s fine, I guess [15:46:03] !log restart FPM on mediawiki canaries to pick up new libxpm [15:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:10] (03CR) 10Filippo Giunchedi: metricsinfra.alertmanager: add victorops and paging route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [15:47:31] (03PS10) 10Cathal Mooney: Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) [15:48:01] (03CR) 10CI reject: [V: 04-1] Streamline BGP neighbor definition in YAML and inclusion in templates [homer/public] - 10https://gerrit.wikimedia.org/r/965516 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:48:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2002.codfw.wmnet [15:48:52] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1102.mgmt.eqiad.wmnet with reboot policy FORCED [15:54:58] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2002.codfw.wmnet [15:55:36] (03CR) 10Arnaudb: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/965531 (https://phabricator.wikimedia.org/T342674) (owner: 10MVernon) [15:56:14] (03CR) 10MVernon: [C: 03+2] netboot: treat moss-be servers like new-style ms-be ones [puppet] - 10https://gerrit.wikimedia.org/r/965531 (https://phabricator.wikimedia.org/T342674) (owner: 10MVernon) [15:56:39] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore2001.codfw.wmnet [15:56:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [15:57:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [15:57:12] (03CR) 10David Caro: metricsinfra.alertmanager: add victorops and paging route (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [15:57:13] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:00:06] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1600) [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:19] (03PS1) 10JHathaway: dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) [16:00:24] (03PS3) 10David Caro: metricsinfra.alertmanager: add victorops and paging route [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T323713) [16:00:28] (03CR) 10David Caro: metricsinfra.alertmanager: add victorops and paging route (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T323713) (owner: 10David Caro) [16:00:49] (03CR) 10CI reject: [V: 04-1] dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:00:51] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [16:01:28] (03PS36) 10Btullis: Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [16:01:30] (03PS17) 10Btullis: Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) [16:01:32] (03PS47) 10Btullis: Deploy multiple spark shuffler services to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/963304 (https://phabricator.wikimedia.org/T344910) [16:02:03] (03CR) 10CI reject: [V: 04-1] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:02:59] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:03:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore2001.codfw.wmnet [16:03:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: setup in progress [16:04:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: setup in progress [16:04:36] (03PS2) 10JHathaway: dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) [16:05:03] (03CR) 10Majavah: [C: 03+1] metricsinfra.alertmanager: add victorops and paging route [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T323713) (owner: 10David Caro) [16:06:43] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 28): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44028/console" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:06:53] (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:08:35] (03CR) 10Btullis: Support multiple spark yarn shufflers in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:09:18] !log installing batik security updates [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:42] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:11:12] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:11:17] (03CR) 10Btullis: [C: 03+2] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:12:50] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:13:15] (03CR) 10Btullis: [C: 03+2] Support configuring the spark3 defaults with the default shuffler [puppet] - 10https://gerrit.wikimedia.org/r/963989 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:13:31] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1101 - jclark@cumin1001" [16:14:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cp1101 - jclark@cumin1001" [16:14:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) > Networking Setup: 2 connections, 10G. public1-*-eqiad This is incorrect. All these hosts should have a single con... [16:14:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [16:14:49] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [16:14:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [16:14:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [16:15:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [16:15:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [16:15:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) [16:16:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [16:17:27] !log disable puppet on A:dns-rec to roll out CR: 965187 T348041 [16:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:31] T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 [16:18:32] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: announce ns0 IP from bird (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/965187 (https://phabricator.wikimedia.org/T348041) (owner: 10Ssingh) [16:19:25] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1103.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:25] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:19:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:22:26] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1103'] [16:24:14] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED [16:25:07] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1104.mgmt.eqiad.wmnet with reboot policy FORCED [16:26:13] !log enable puppet on A:dns-rec and force agent run: T348041 [16:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:18] T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 [16:26:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [16:27:32] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104'] [16:27:49] !log pt1979@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [16:28:19] (03CR) 10Stevemunene: Switch druid1004 zookeeper node with druid1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965499 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [16:28:52] !log pt1979@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [16:28:57] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bullseye [16:29:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye completed: - cloud... [16:31:57] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:33:42] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1103'] [16:34:16] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF) [16:34:24] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 4 others: Migrate functions-orchestrator service to mw-api-int - https://phabricator.wikimedia.org/T347397 (10Jdforrester-WMF) [16:34:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [16:35:12] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:35:55] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) [16:36:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [16:36:45] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) The static routes have been removed and `ns[01]` are now announced via `bird`. Thanks to @ayounsi for his help with this! [16:36:51] 10SRE, 10Traffic: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) [16:37:24] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) 05Open→03Resolved a:03ssingh [16:40:50] (03PS9) 10Ryan Kemper: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:41:01] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:41:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [16:50:27] (03PS1) 10Sbailey: Set UseParserMigration true in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) [16:50:55] !log pt1979@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [16:53:27] (03CR) 10Sbailey: "Sets config value and removes unused parsoid config variable as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [16:53:52] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [16:54:34] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 751 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:55:14] !log pt1979@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [16:55:20] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bullseye [16:55:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye completed: - cloud... [16:55:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [16:57:10] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1700) [17:03:52] (03PS1) 10FNegri: [openstack] remove hiera override for 2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/965546 (https://phabricator.wikimedia.org/T341285) [17:06:16] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44029/console" [puppet] - 10https://gerrit.wikimedia.org/r/965546 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:07:37] (03CR) 10C. Scott Ananian: Set UseParserMigration true in wmf-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [17:11:06] (03PS2) 10Sbailey: Set UseParserMigration true in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) [17:11:52] (03CR) 10Sbailey: Set UseParserMigration true in wmf-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [17:12:23] !log pt1979@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [17:12:26] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104'] [17:12:57] (03CR) 10Andrew Bogott: [C: 03+1] "This probably just a leftover from an earlier incremental designate upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/965546 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:13:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [17:13:23] !log pt1979@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [17:13:29] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1067.eqiad.wmnet with OS bullseye [17:13:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye completed: - cloud... [17:15:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) [17:15:23] (03CR) 10FNegri: [V: 03+1 C: 03+2] [openstack] remove hiera override for 2 hosts [puppet] - 10https://gerrit.wikimedia.org/r/965546 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:16:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) 05Open→03Resolved a:03Papaul @Jclark-ctr @Andrew this now complete. I update the switch ports as recommended @ https://wikitech.wikimedia.org... [17:19:59] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [17:20:38] (03PS1) 10Andrew Bogott: ceph radosgw: don't allow the 'reader' role to create/delete objects [puppet] - 10https://gerrit.wikimedia.org/r/965549 (https://phabricator.wikimedia.org/T276961) [17:21:34] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1105 [17:22:38] (03CR) 10Andrew Bogott: [C: 03+2] ceph radosgw: don't allow the 'reader' role to create/delete objects [puppet] - 10https://gerrit.wikimedia.org/r/965549 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [17:22:57] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1105 [17:23:54] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:28] jouncebot: nowandnext [17:25:28] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1700) [17:25:28] In 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1800) [17:26:06] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1106 [17:26:08] (03PS1) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965219 (https://phabricator.wikimedia.org/T342353) [17:27:26] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1106 [17:31:03] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008 [17:32:22] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:32:28] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:33:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008 [17:34:05] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009 [17:34:38] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1009.wikimedia.org with OS bullseye [17:35:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [17:35:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye [17:35:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [17:35:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009 [17:35:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [17:35:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [17:36:27] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010 [17:37:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010 [17:43:00] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [17:43:01] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [17:43:08] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [17:43:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [17:43:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [17:43:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [17:53:32] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [17:55:26] (03CR) 10C. Scott Ananian: [C: 03+1] Set UseParserMigration true in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [17:57:36] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1106.mgmt.eqiad.wmnet with reboot policy FORCED [17:59:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [18:00:06] hashar and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T1800) [18:00:43] (03PS2) 10Jdlrobson: Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106) [18:25:18] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:31:30] (03PS2) 10Urbanecm: Revert "Growth: Enable Welcome survey user research for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965219 (https://phabricator.wikimedia.org/T342353) [18:41:23] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Papaul) @Eevans thanks. what about next week Monday the 16th at 10:00am CT [18:41:49] (03PS16) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [18:47:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:22] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:53] (03CR) 10Gehel: wdqs: Set up graph_split hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [18:50:53] (03PS1) 10Jforrester: Don't try to lock to serialize m3u8 file writes [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965220 (https://phabricator.wikimedia.org/T348689) [18:52:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:13] (03PS17) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [18:58:08] (03PS18) 10Ryan Kemper: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [18:58:10] (03PS4) 10Ryan Kemper: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [18:59:59] (03PS19) 10Ryan Kemper: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:00:01] (03PS5) 10Ryan Kemper: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:00:34] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:01:47] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: new graph split hosts T347505 [19:01:59] T347505: Prepare new WDQS hosts for graph splitting - https://phabricator.wikimedia.org/T347505 [19:02:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: new graph split hosts T347505 [19:02:19] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:02:29] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:03:07] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:03:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [19:03:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1009.wikimedia.org with OS bullseye [19:03:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye [19:03:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [19:03:28] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [19:03:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [19:03:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [19:14:39] (03PS1) 10Ryan Kemper: wdqs: fix graph split cyclical dependency [puppet] - 10https://gerrit.wikimedia.org/r/965560 (https://phabricator.wikimedia.org/T347505) [19:15:01] (03CR) 10Gehel: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965560 (https://phabricator.wikimedia.org/T347505) (owner: 10Ryan Kemper) [19:15:11] (03PS2) 10Ryan Kemper: wdqs: fix graph split cyclical dependency [puppet] - 10https://gerrit.wikimedia.org/r/965560 (https://phabricator.wikimedia.org/T347505) [19:15:31] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: fix graph split cyclical dependency [puppet] - 10https://gerrit.wikimedia.org/r/965560 (https://phabricator.wikimedia.org/T347505) (owner: 10Ryan Kemper) [19:15:33] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: fix graph split cyclical dependency [puppet] - 10https://gerrit.wikimedia.org/r/965560 (https://phabricator.wikimedia.org/T347505) (owner: 10Ryan Kemper) [19:16:19] (03PS1) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [19:16:51] 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2), 10Sustainability (Incident Followup): Alert when no data is received from Prometheus in a certain amount of time - https://phabricator.wikimedia.org/T336448 (10andrea.denisse) 05Open→03In progress [19:26:22] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/965561/44030/" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [19:31:48] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010'] [19:33:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009'] [19:34:37] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:34:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:35:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:35:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:35:28] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965562 (https://phabricator.wikimedia.org/T315353) [19:36:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:36:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:37:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:37:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:38:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1010'] [19:40:03] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:40:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1009'] [19:40:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:40:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:40:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:40:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:40:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:41:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:41:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:41:52] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: new graph split hosts T347505 [19:41:57] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: new graph split hosts T347505 [19:43:04] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:43:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:45:27] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [19:45:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.... [19:45:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [19:45:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:45:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.... [19:45:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [19:46:21] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:44] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008 [19:47:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008 [19:49:16] (03PS5) 10SBassett: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [19:54:05] jouncebot: next [19:54:05] In 0 hour(s) and 5 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T2000) [19:55:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED [19:55:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:55:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:56:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:56:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:57:01] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [19:58:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:58:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [19:58:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [19:59:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage [20:00:05] brennen, TheresNoTime, and dr0ptp4kt: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231012T2000). [20:00:06] greg-g, kimberly_sarabia, Urbanecm, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.wikimedia.org with reason: host reimage [20:00:20] hi [20:00:27] hello [20:00:30] Hi [20:00:45] I can deploy if needed, but dunno if I should :) [20:00:55] I am here [20:01:12] urbanecm: we do have a trainee today :) [20:01:13] ahoy hoy [20:01:13] hey urbanecm - we're doing an actual training this time 'round. [20:01:49] that trainee has been very good about pre-reviewing the patches :) [20:01:52] brennen: sounds good to me. i'll wait for someone to deploy for me then :)) [20:02:40] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage [20:04:28] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:04:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:05:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.wikimedia.org with reason: host reimage [20:05:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:05:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:06:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:06:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [20:06:30] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:06:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:06:40] good time for cumin spam :) [20:07:24] if i can help with anything, do feel free to let me know :). [20:07:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:07:45] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [20:07:59] urbanecm: will do, we're getting set up with all the right windows :) [20:08:01] * greg-g assumes there's video chatting going on with the trainers + trainee [20:08:05] ^ [20:09:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:09:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [20:09:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:10:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:10:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:10:44] greg-g: ready? [20:10:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:10:47] yup! [20:10:55] okay, getting ready to go [20:13:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [20:14:58] (03Merged) 10jenkins-bot: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [20:15:12] !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:957983|Allow FundraiseUp scripts in Donatewiki CSP (T345379)]] [20:15:17] T345379: Add CSP for Fundraiseup on DonateWiki - https://phabricator.wikimedia.org/T345379 [20:16:08] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:16:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [20:16:27] !log dr0ptp4kt@deploy2002 dr0ptp4kt and ejegg: Backport for [[gerrit:957983|Allow FundraiseUp scripts in Donatewiki CSP (T345379)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:16:33] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1008'] [20:17:08] I see the updated CSP when using mwdebug [20:17:10] greg-g: would you please test on mwdebug and confirm back ? [20:17:14] you're too quick [20:17:15] beat ya [20:17:17] :P [20:17:20] i will continue with sync [20:17:25] !log dr0ptp4kt@deploy2002 dr0ptp4kt and ejegg: Continuing with sync [20:17:59] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:21:10] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:22:52] I just got a DBQueryTimeoutError on loginwiki looking at 1000 new users (https://login.wikimedia.org/w/index.php?title=Special:Log/newusers&type=newusers&user=&offset=20231011163650%7C40075009) but previous groups of 1000 didn't have any problems a few minutes ago, did something just change? (ac520575-a3cb-4958-8fe3-326a5cfac8ca) [20:22:53] !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:957983|Allow FundraiseUp scripts in Donatewiki CSP (T345379)]] (duration: 07m 40s) [20:22:57] T345379: Add CSP for Fundraiseup on DonateWiki - https://phabricator.wikimedia.org/T345379 [20:23:54] greg-g: it's live. holler if a problem [20:23:57] DannyS712: seems to be a slower than usually. same on meta. train related? [20:24:01] dr0ptp4kt: first deploy! [20:24:03] (03PS1) 10Jdrewniak: [Prototype] Make prototype skin specific & minor fixes [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965223 [20:24:39] dr0ptp4kt: verified [20:24:47] thx [20:25:06] * jan_drewniak hey everyone, I might have a last minute addition to the deployment window. https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/965223 I know syncs take a long time, so I can deploy it myself after everyone is done [20:25:47] * jan_drewniak (...I keep pressing shift+enter after I changed that option on Slack) :P [20:25:53] urbanecm just reloaded and it worked fine, hopefully it was just a one-off (since it was working for previous batches of 1000 right before) - will note if it happens again [20:25:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:26:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.wikimedia.org with OS bullseye [20:26:06] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:26:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.wikimedia.org with OS bullseye [20:26:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki... [20:26:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [20:26:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki... [20:26:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [20:27:57] kimberly_sarabia: I have no idea how to deploy schemas/event, actually :) [20:28:09] thcipriani: they're deployed automatically, no action needed. [20:28:13] +2 is enough [20:28:20] which seems to have happened [20:28:21] ah, ok, looks like it was already +2d [20:29:02] it might need to be enabled in config though (https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Registering_the_stream_with_EventLogging) [20:29:21] (no, it's not a new schema) [20:29:36] kimberly_sarabia: second question, on https://gerrit.wikimedia.org/r/965246/ does enwiki need to be explicitly set to 1? Since the default is 1? [20:30:41] For the second question. Yes. [20:31:32] okie doke, wanted to verify, we can merge that one :) [20:32:19] For the first issue, let me double-check. I was told to backport it, but I might be misreading. I was trying to remember how I did it in the past but let me ask. [20:33:27] (03PS3) 10Dr0ptp4kt: Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson) [20:33:47] (03CR) 10Dr0ptp4kt: [C: 03+2] Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson) [20:34:18] kimberly_sarabia: this should be deployed on beta cluster in about 10 minutes [20:34:30] (03Merged) 10jenkins-bot: Beta cluster: mobile web click tracking schema at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965246 (https://phabricator.wikimedia.org/T346106) (owner: 10Jdlrobson) [20:34:51] kimberly_sarabia: my knowledge of schemas isn't perfect, but...i think in https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/965258/, the version in ID should've been bumped to a higher version, since new fields are added? [20:34:58] dr0ptp4kt: gotcha [20:35:10] this will be tracked at https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ [20:35:19] ^ next run there should make your patch live :) [20:35:24] (in beta) [20:37:01] urbanecm: technically, i didn't add a new field, i just moved it down the tree [20:37:06] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008 [20:37:08] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008 [20:37:12] But one moment, let me double check [20:37:30] thanks kimberly_sarabia and urbanecm [20:37:36] in the interim we'll get you going urbanecm [20:37:37] urbanecm: about to start on yours [20:37:51] ack! [20:38:02] (03CR) 10CI reject: [V: 04-1] [Prototype] Make prototype skin specific & minor fixes [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965223 (owner: 10Jdrewniak) [20:38:27] (03PS3) 10Dr0ptp4kt: Revert "Growth: Enable Welcome survey user research for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965219 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [20:38:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [20:38:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965219 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [20:38:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [20:39:18] (03Abandoned) 10Andrea Denisse: alert: Add the alert (icinga + alertmanager) hosts Bookworm node definitions [puppet] - 10https://gerrit.wikimedia.org/r/934245 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [20:40:07] (03Merged) 10jenkins-bot: Revert "Growth: Enable Welcome survey user research for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965219 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [20:41:50] !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:965219|Revert "Growth: Enable Welcome survey user research for enwiki" (T342353)]] [20:42:00] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [20:43:06] !log dr0ptp4kt@deploy2002 dr0ptp4kt and urbanecm: Backport for [[gerrit:965219|Revert "Growth: Enable Welcome survey user research for enwiki" (T342353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:43:26] urbanecm: would you please check mwdebug servers? [20:43:29] sure [20:43:59] looks perfect [20:45:01] thx, about to continue with sync [20:45:03] !log dr0ptp4kt@deploy2002 dr0ptp4kt and urbanecm: Continuing with sync [20:45:05] ty [20:47:45] (03PS2) 10Thcipriani: Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965562 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:50:22] !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:965219|Revert "Growth: Enable Welcome survey user research for enwiki" (T342353)]] (duration: 08m 32s) [20:50:32] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [20:51:22] urbanecm: it's deployed [20:51:27] thanks [20:51:34] PROBLEM - Check systemd state on mw1489 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:35] alrighty, I'm going to jump in to close us out, nice work dr0ptp4kt! No t-shirt yet :D [20:51:46] * dr0ptp4kt wipes sweat from brow [20:52:44] MatmaRex: still with us? [20:52:51] yeah [20:52:55] thcipriani: i think we cancelled the shirts? :D [20:53:11] are we out of time? i guess my things aren't too urgent [20:53:28] eh, I can do the last two if you've got time (and sbailey has time) [20:53:45] I can hang as long as needed [20:54:02] alrighty [20:54:48] MatmaRex: so I'm going to deploy yours, then do you need me to run these maintenance scripts as well? Looks like they'll take a bit, is that right? [20:56:03] thcipriani: yes. run them in a screen or something, and redirect the output somewhere [20:56:08] they might take a few hours to a few days [20:56:24] thanks [20:56:25] i'm hanging around for the sbailey patch too [20:56:36] (03Abandoned) 10Jdrewniak: [Prototype] Make prototype skin specific & minor fixes [skins/Vector] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965223 (owner: 10Jdrewniak) [20:56:51] Maybe do the sbailey patch first, should be quick and scott and I are both on it [20:57:17] yeah, i can wait [20:57:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965562 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:57:58] oh whoops, just saw that after I hit enter [20:58:08] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965562 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:58:09] no prob, we'll wait [20:58:14] sorry about that :( [20:58:22] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:965562|Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 (T315353)]] [20:58:26] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:59:36] !log thcipriani@deploy2002 thcipriani and matmarex: Backport for [[gerrit:965562|Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 (T315353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:59:57] ^ anything to check there? [21:00:24] nope [21:00:43] k [21:00:51] !log thcipriani@deploy2002 thcipriani and matmarex: Continuing with sync [21:01:42] actually, i can erify it - this page works with mwdebug, doesn't work before the change: https://vi.wikipedia.org/wiki/Đặc_biệt:FindComment?idorname=c-DHN-20230805180800-Giao_diện_di_động [21:01:49] verify* [21:02:32] (03PS3) 10C. Scott Ananian: Set UseParserMigration true in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [21:02:34] nice :) [21:05:39] thcipriani: We can just skip `Refactor Schema Structure` [21:05:47] Will remove it from the list [21:06:17] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:965562|Enable wgDiscussionToolsEnablePermalinksBackend on s7 group2 (T315353)]] (duration: 07m 55s) [21:06:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [21:06:26] kimberly_sarabia: sure, thanks for cleanup! [21:06:29] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [21:06:40] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [21:06:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T343198)', diff saved to https://phabricator.wikimedia.org/P52920 and previous config saved to /var/cache/conftool/dbconfig/20231012-210646-arnaudb.json [21:06:51] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:07:20] MatmaRex: alright, I'll tee both of those maintenance scripts out to tmp /tmp/persistentRevisionThreadItems-s7.log and /tmp/persistentRevisionThreadItems-s6.log running in tmux fine to run both at the same time? [21:08:10] thanks. yes, should be fine at the same time [21:09:20] !log mwmaint2002:foreachwikiindblist 'group2 & s7' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230613000000 | tee /tmp/persistentRevisionThreadItems-s7.log [21:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:30] !log mwmaint2002:foreachwikiindblist 'group2 & s6' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230613000000 | tee /tmp/persistentRevisionThreadItems-s6.log [21:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:56] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:10:12] sbailey: you're up, sorry for delay [21:10:22] sounds good [21:10:25] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:10:25] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [21:10:40] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:10:58] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:11:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [21:11:48] (03Merged) 10jenkins-bot: Set UseParserMigration true in wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965542 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [21:12:02] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:965542|Set UseParserMigration true in wmf-config (T333179)]] [21:12:16] T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179 [21:13:20] !log thcipriani@deploy2002 sbailey and thcipriani: Backport for [[gerrit:965542|Set UseParserMigration true in wmf-config (T333179)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:43] ^ sbailey should be live on mwdebug, check please! [21:13:51] checking [21:16:06] mwdebug2001, right? [21:16:30] cscott: live on all of the mwdebug machines, mwdebug2001 should work [21:19:43] sbailey: i'm seeing the parser migration extension installed on enwiki and (eg) https://en.wikipedia.org/w/index.php?title=4-Hydroxybenzoate_geranyltransferase&action=parsermigration-edit works.  The ?useparsoid=1 hack seems to work as well, so everything looks good to me.  How about you? [21:20:31] Having trouble with access permissions, so going with your results [21:21:40] Good to sync [21:21:53] +1 [21:21:57] ack, thanks for checking all, going live! [21:22:06] !log thcipriani@deploy2002 sbailey and thcipriani: Continuing with sync [21:27:22] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:965542|Set UseParserMigration true in wmf-config (T333179)]] (duration: 15m 20s) [21:27:28] T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179 [21:27:39] ^ sbailey cscott all done! Thanks for hanging out a little late all! [21:27:44] thank you! [21:27:53] thanks thcipriani [21:27:59] Thank you [21:28:04] <3 thanks all [21:30:41] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Don-vip) Three more errors today for me, including from other sites than Flickr: - https://common... [21:59:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye [21:59:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [22:00:11] urbanecm: if you're around, can you check the progress of the enwiki/s1 script at https://phabricator.wikimedia.org/T315510 for me? thanks [22:10:33] (03CR) 10Subramanya Sastry: [C: 03+1] "We can get this in along with a puppet patch to change mariadb my.cnf file since we need to reduce the innodb_buffer_pool_size to 4600M (f" [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [22:25:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:52:00] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:53:34] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:48] 10SRE, 10ops-codfw: codfw: Move sessionstore2001 to B8 - https://phabricator.wikimedia.org/T348142 (10Eevans) >>! In T348142#9247512, @Papaul wrote: > @Eevans thanks. what about next week Monday the 16th at 10:00am CT Perfect; Let's do it. [23:30:54] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:38] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:22] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook