[00:38:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015 [00:38:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015 (owner: 10TrainBranchBot) [00:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:43:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343198)', diff saved to https://phabricator.wikimedia.org/P52377 and previous config saved to /var/cache/conftool/dbconfig/20230911-004331-arnaudb.json [00:43:35] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [00:44:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:47:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:51:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956015 (owner: 10TrainBranchBot) [00:58:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P52378 and previous config saved to /var/cache/conftool/dbconfig/20230911-005837-arnaudb.json [01:04:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:13:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P52379 and previous config saved to /var/cache/conftool/dbconfig/20230911-011343-arnaudb.json [01:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:56] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:21:27] (03PS5) 10Andrew Bogott: wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) [01:21:29] (03PS1) 10Andrew Bogott: nova_fullstack_test: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) [01:21:55] (03CR) 10Andrew Bogott: "tested in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [01:28:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343198)', diff saved to https://phabricator.wikimedia.org/P52380 and previous config saved to /var/cache/conftool/dbconfig/20230911-012850-arnaudb.json [01:28:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [01:28:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [01:29:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [01:29:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52381 and previous config saved to /var/cache/conftool/dbconfig/20230911-012911-arnaudb.json [02:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:43:05] PROBLEM - Disk space on dbprov1004 is CRITICAL: DISK CRITICAL - free space: /srv 546743 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov1004&var-datasource=eqiad+prometheus/ops [04:15:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:20:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:44:23] RECOVERY - Disk space on dbprov1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov1004&var-datasource=eqiad+prometheus/ops [04:49:15] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2: Marostegui https://phabricator.wikimedia.org/T346012 https://wikitech.wikimedia.org/wiki/HAProxy [04:49:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P52382 and previous config saved to /var/cache/conftool/dbconfig/20230911-045907-root.json [05:00:49] !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1134.eqiad.wmnet onto db1128.eqiad.wmnet [05:09:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:14:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix cloudbackup alias [puppet] - 10https://gerrit.wikimedia.org/r/955923 (owner: 10Muehlenhoff) [05:15:56] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:31:57] (03CR) 10Muehlenhoff: "A few additional comments" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [05:38:27] (03PS1) 10Marostegui: instances.yaml: Add db1119 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/956100 (https://phabricator.wikimedia.org/T339185) [05:38:56] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1119 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/956100 (https://phabricator.wikimedia.org/T339185) (owner: 10Marostegui) [05:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1119 back to s1 depooled T339185', diff saved to https://phabricator.wikimedia.org/P52383 and previous config saved to /var/cache/conftool/dbconfig/20230911-054057-marostegui.json [05:41:01] T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185 [06:11:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136065 [06:12:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136065 [06:26:33] (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [06:43:31] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) Unfortunately the errors are back, even though not much it's still better to fix the issue. [06:50:59] (03CR) 10Muehlenhoff: [C: 03+2] Use a single ensure for managing the nftables state [puppet] - 10https://gerrit.wikimedia.org/r/955779 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:57:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1134.eqiad.wmnet onto db1128.eqiad.wmnet [06:57:17] (03CR) 10Muehlenhoff: [C: 03+2] Pass down the ensure to the requestctl settings [puppet] - 10https://gerrit.wikimedia.org/r/955865 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:58:48] (03PS1) 10Marostegui: Revert "db1128: Host crashed" [puppet] - 10https://gerrit.wikimedia.org/r/956054 [06:59:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:19] (03PS2) 10Kosta Harlan: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) [06:59:24] (03PS2) 10Kosta Harlan: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) [06:59:28] (03PS2) 10Kosta Harlan: [beta] ReportIncident: Enable on kowiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) [06:59:33] (03PS2) 10Kosta Harlan: [beta] Enable ReportIncident for configured beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) [06:59:38] (03PS2) 10Kosta Harlan: ReportIncident: Set default help page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) [07:00:06] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T0700) [07:00:06] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] (03CR) 10Marostegui: [C: 03+2] Revert "db1128: Host crashed" [puppet] - 10https://gerrit.wikimedia.org/r/956054 (owner: 10Marostegui) [07:00:38] good morning [07:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 1%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52384 and previous config saved to /var/cache/conftool/dbconfig/20230911-070114-root.json [07:01:18] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [07:02:50] I've not deployed a change to wmf-config/extension-list before. Do I use `scap backport` for this? [07:02:56] morning [07:04:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:33] yes, `scap backport` is fine [07:06:06] just make sure to do that separately to the patch enabling the extension [07:06:06] ok [07:06:29] taavi: does this stack of patches look OK to you? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/953998/ [07:06:29] kostajh: do you need someone to deploy it for you or will you self-deploy? (sorry, I don't remember if you have the rights or not) [07:06:41] I can self-deploy if the patches look ok [07:06:52] the intended outcome is: extension disabled in production, and enabled in kowiki on betalabs [07:08:18] seems fine on a quick glance [07:08:25] alright [07:08:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [07:10:46] (03Merged) 10jenkins-bot: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [07:11:17] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] [07:11:21] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [07:16:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 3%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52385 and previous config saved to /var/cache/conftool/dbconfig/20230911-071619-root.json [07:16:23] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [07:17:32] (03PS7) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [07:17:41] (03CR) 10Slyngshede: Allow packing as a .deb (0320 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [07:22:42] (waiting for k8s image build/push to do its thing) [07:23:57] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] [07:23:59] trying again with `tmux`, as the connection hung up :\ [07:23:59] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [07:27:23] (03CR) 10Muehlenhoff: Allow packing as a .deb (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [07:31:05] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/956086 (owner: 10Majavah) [07:31:21] (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: add missing trailing slash to url [puppet] - 10https://gerrit.wikimedia.org/r/956086 (owner: 10Majavah) [07:31:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 5%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52386 and previous config saved to /var/cache/conftool/dbconfig/20230911-073124-root.json [07:31:29] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [07:31:38] (03PS8) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [07:32:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [07:33:36] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:33:38] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [07:35:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:35:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [07:36:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not sure if you are waiting for me on merging this, at any rate I'll go ahead and merge! HTH" [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol) [07:36:05] (03CR) 10Filippo Giunchedi: [C: 03+2] Grant permissions on icinga to user Brouberol [puppet] - 10https://gerrit.wikimedia.org/r/955924 (owner: 10Brouberol) [07:36:24] !log kharlan@deploy1002 kharlan: Continuing with sync [07:41:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52387 and previous config saved to /var/cache/conftool/dbconfig/20230911-074116-root.json [07:42:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Patch LGTM, thank you!. I've cc'ed Ben for an heads-up: this change won't impact existing statsd metrics, and will make graphite failover " [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [07:43:20] (03PS9) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [07:43:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:35] (03CR) 10Slyngshede: Allow packing as a .deb (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [07:45:30] 10SRE, 10Data-Platform-SRE, 10Observability-Metrics, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10fgiunchedi) The statsd-exporter part of this work is happening in {T345790} because we need to make graphite failovers simpler. Technically... [07:46:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52388 and previous config saved to /var/cache/conftool/dbconfig/20230911-074629-root.json [07:46:33] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [07:46:41] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:953998|Add ReportIncident extension (T339275)]] (duration: 22m 44s) [07:46:44] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [07:48:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [07:48:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:48:36] taavi: `scap backport` is not really useful for beta cluster patches, is that correct? I can just +2 those myself via the gerrit UI? [07:48:49] (03Merged) 10jenkins-bot: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [07:49:08] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]] [07:49:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [07:49:39] kostajh: `scap backport` merges the patch and pulls it to the deployment server so the next deployer won't have an unexpected git state. you can do that manually too, yes [07:49:44] (03PS10) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [07:49:51] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 (owner: 10Majavah) [07:50:05] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 (owner: 10Majavah) [07:50:43] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:52:06] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge::checker: remove ToolsDB R/W check [puppet] - 10https://gerrit.wikimedia.org/r/956071 (https://phabricator.wikimedia.org/T313030) (owner: 10Majavah) [07:52:19] (03CR) 10Majavah: [C: 03+2] icinga: drop toolforge.org cert monitor [puppet] - 10https://gerrit.wikimedia.org/r/956029 (owner: 10Majavah) [07:52:35] (03CR) 10Majavah: [C: 03+2] icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 (owner: 10Majavah) [07:52:45] (03PS2) 10Majavah: icinga: drop tools.wmflabs.org monitoring [puppet] - 10https://gerrit.wikimedia.org/r/956072 [07:53:58] !log kharlan@deploy1002 kharlan: Continuing with sync [07:54:17] taavi: ack, thanks [07:56:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52389 and previous config saved to /var/cache/conftool/dbconfig/20230911-075621-root.json [07:58:12] (03CR) 10Filippo Giunchedi: [C: 03+2] citoid: update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/955894 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [07:58:19] (03CR) 10Filippo Giunchedi: [C: 03+2] citoid: enable mesh tracing in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/955895 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [07:59:12] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [07:59:43] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [08:00:10] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [08:00:13] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [08:00:24] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:953999|ReportIncident: Default deployment to false (T339275)]] (duration: 11m 15s) [08:00:32] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [08:00:33] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [08:00:45] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [08:01:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52390 and previous config saved to /var/cache/conftool/dbconfig/20230911-080133-root.json [08:01:37] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [08:01:41] !log filippo@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [08:01:44] (03CR) 10Muehlenhoff: "Two more comments inline, which I had missed before" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [08:02:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [08:02:07] !log filippo@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [08:02:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [08:02:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) (owner: 10Kosta Harlan) [08:02:26] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [08:02:53] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [08:02:59] (03Merged) 10jenkins-bot: [beta] ReportIncident: Enable on kowiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955732 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [08:03:25] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [08:03:28] (03Merged) 10jenkins-bot: [beta] Enable ReportIncident for configured beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955735 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [08:03:31] (03Merged) 10jenkins-bot: ReportIncident: Set default help page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955821 (https://phabricator.wikimedia.org/T343382) (owner: 10Kosta Harlan) [08:03:46] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]] [08:03:47] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [08:03:51] T343382: Make link to code of conduct and wiki administrators page configurable per wiki - https://phabricator.wikimedia.org/T343382 [08:05:15] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deplo [08:05:15] yment (accessible via k8s-experimental XWD option) [08:05:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [08:06:48] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10fgiunchedi) Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40sta... [08:07:37] !log kharlan@deploy1002 kharlan: Continuing with sync [08:08:19] (03PS1) 10Tim Starling: Remove PHP 7.2 fallback for array_key_first() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956364 [08:08:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [08:11:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52391 and previous config saved to /var/cache/conftool/dbconfig/20230911-081126-root.json [08:13:31] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:955732|[beta] ReportIncident: Enable on kowiki beta (T339275)]], [[gerrit:955735|[beta] Enable ReportIncident for configured beta wikis (T339275)]], [[gerrit:955821|ReportIncident: Set default help page (T343382)]] (duration: 09m 44s) [08:13:35] T343382: Make link to code of conduct and wiki administrators page configurable per wiki - https://phabricator.wikimedia.org/T343382 [08:13:35] T339275: Deploy to beta cluster - https://phabricator.wikimedia.org/T339275 [08:13:45] !log UTC morning deploys done [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52392 and previous config saved to /var/cache/conftool/dbconfig/20230911-081638-root.json [08:16:42] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [08:17:32] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:20:26] !log rebooting mwdebug1002.eqiad.wmnet [08:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug1002.eqiad.wmnet [08:20:56] (03PS11) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [08:21:39] (03CR) 10Slyngshede: Allow packing as a .deb (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [08:22:22] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi) [08:22:48] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10fgiunchedi) [08:24:17] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:25:03] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Validate SA tokens with the certs of all masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:25:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1002.eqiad.wmnet [08:26:15] !log rebooting mwdebug1001.eqiad.wmnet [08:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwdebug1001.eqiad.wmnet [08:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52393 and previous config saved to /var/cache/conftool/dbconfig/20230911-082631-root.json [08:28:04] (03PS12) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [08:28:15] PROBLEM - HTTPS Ganeti RAPI esams on ganeti3007 is CRITICAL: connect to address ganeti01.svc.esams.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [08:31:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52394 and previous config saved to /var/cache/conftool/dbconfig/20230911-083143-root.json [08:31:47] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [08:32:36] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [08:32:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52395 and previous config saved to /var/cache/conftool/dbconfig/20230911-083258-arnaudb.json [08:33:02] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:33:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1119 with Debian Bookworm in s1 with just 1% T339185', diff saved to https://phabricator.wikimedia.org/P52396 and previous config saved to /var/cache/conftool/dbconfig/20230911-083346-marostegui.json [08:33:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1001.eqiad.wmnet [08:33:51] T339185: Test MariaDB + Debian bookworm on databases - https://phabricator.wikimedia.org/T339185 [08:34:33] (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956366 (https://phabricator.wikimedia.org/T339185) [08:37:13] !log rebooting mwmaint2002.codfw.wmnet [08:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:19] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint2002.codfw.wmnet [08:40:28] jouncebot: nowandnext [08:40:28] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [08:40:28] In 1 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000) [08:40:59] (03PS3) 10Urbanecm: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) [08:41:04] (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm) [08:41:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:41:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52397 and previous config saved to /var/cache/conftool/dbconfig/20230911-084135-root.json [08:41:43] (03Merged) 10jenkins-bot: Revert "Growth: Disable Add an image on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955049 (https://phabricator.wikimedia.org/T345188) (owner: 10Urbanecm) [08:42:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]] [08:42:18] T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188 [08:42:35] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:13] PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:44:32] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:45:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint2002.codfw.wmnet [08:45:40] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:46:17] !log urbanecm@deploy1002 urbanecm: Continuing with sync [08:46:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:46:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Repooling after being recloned T345509', diff saved to https://phabricator.wikimedia.org/P52398 and previous config saved to /var/cache/conftool/dbconfig/20230911-084647-root.json [08:46:51] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [08:48:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P52399 and previous config saved to /var/cache/conftool/dbconfig/20230911-084804-arnaudb.json [08:48:17] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/955916 (owner: 10Muehlenhoff) [08:51:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:51:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [08:51:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52400 and previous config saved to /var/cache/conftool/dbconfig/20230911-085129-arnaudb.json [08:51:33] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:51:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather t... [08:52:32] (JobUnavailable) resolved: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:52:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:955049|Revert "Growth: Disable Add an image on all wikis" (T345188)]] (duration: 10m 27s) [08:52:47] T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188 [08:52:50] * urbanecm done [08:54:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:54:48] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:56:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52401 and previous config saved to /var/cache/conftool/dbconfig/20230911-085640-root.json [08:59:48] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:03:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P52402 and previous config saved to /var/cache/conftool/dbconfig/20230911-090310-arnaudb.json [09:05:13] (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/956367 [09:08:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:10:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956367 (owner: 10Muehlenhoff) [09:11:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52403 and previous config saved to /var/cache/conftool/dbconfig/20230911-091145-root.json [09:11:53] (03PS1) 10Jbond: firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368 [09:14:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:18:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52404 and previous config saved to /var/cache/conftool/dbconfig/20230911-091817-arnaudb.json [09:18:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:18:20] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:18:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:18:32] jouncebot: nowandnext [09:18:32] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [09:18:32] In 0 hour(s) and 41 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000) [09:18:41] !log rebooting deploy2002.codfw.wmnet [09:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:48] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [09:19:07] (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/956366 (https://phabricator.wikimedia.org/T339185) (owner: 10Marostegui) [09:22:04] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki) [09:22:09] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10jijiki) @Jhancock.wm I am afraid the server is dead again :( [09:24:20] (03PS2) 10Jbond: firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368 [09:24:21] !log gehel@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor [09:24:24] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [09:24:34] !log gehel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor [09:25:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:26:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [09:26:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52405 and previous config saved to /var/cache/conftool/dbconfig/20230911-092650-root.json [09:29:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/956368 (owner: 10Jbond) [09:29:39] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:30:27] ^That's my bad [09:32:37] !log rearmed keyholder on deploy2002.codfw.wmnet [09:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:31] (03PS1) 10Hnowlan: jobqueue: limit thumbnailrender job concurrency further [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) [09:33:58] (03CR) 10Jbond: [C: 03+2] firewall: move requestctl logic outside of the ferm block [puppet] - 10https://gerrit.wikimedia.org/r/956368 (owner: 10Jbond) [09:34:39] (KeyholderUnarmed) resolved: 18 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:35:54] (03PS2) 10Hnowlan: jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) [09:38:24] (03PS1) 10Jbond: firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 [09:38:41] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lw in enwiki and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) [09:40:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond) [09:40:49] (03CR) 10CI reject: [V: 04-1] firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond) [09:42:58] (03PS2) 10Jbond: firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 [09:43:04] (03CR) 10Jbond: [C: 03+2] firewall: only create stub file in the present changes [puppet] - 10https://gerrit.wikimedia.org/r/956371 (owner: 10Jbond) [09:43:39] (03CR) 10Elukey: [C: 03+1] "wow!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [09:43:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [09:48:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [09:48:48] (03PS1) 10Elukey: profile::service_proxy::envoy: set use_ingress for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/956373 (https://phabricator.wikimedia.org/T339890) [09:50:06] (03CR) 10Elukey: [C: 03+2] profile::service_proxy::envoy: set use_ingress for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/956373 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [09:51:59] (03PS1) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 [09:52:35] (03PS1) 10Jbond: ferm: Add force true to force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/956375 [09:53:08] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:53:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org [09:55:52] (03Abandoned) 10Jbond: puppetdb: migrate check to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955928 (https://phabricator.wikimedia.org/T345909) (owner: 10Jbond) [09:56:03] (03Abandoned) 10Jbond: check_puppet_run_changes: update to run on puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/955939 (owner: 10Jbond) [09:56:30] (03PS2) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 [09:57:54] (03PS3) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) [09:57:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org [09:59:49] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) Actually, after having experimented with supporting both OpenSearch and Elasticsearch in spicerack with local experiments, we've decided to put a pin... [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000) [10:03:40] !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye [10:07:15] (03PS1) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890) [10:08:12] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:32] (03PS1) 10Elukey: ml-services: update Docker image for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890) [10:09:47] jouncebot: nowandnext [10:09:47] For the next 0 hour(s) and 50 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1000) [10:09:47] In 2 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300) [10:10:33] (JobUnavailable) firing: (3) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:11:20] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:11:31] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/956380 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:11:36] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:05] (03PS1) 10AikoChou: ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381 [10:14:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:15:32] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:16:02] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [10:18:29] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [10:21:12] (03CR) 10AikoChou: [C: 03+2] ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381 (owner: 10AikoChou) [10:21:54] (03Merged) 10jenkins-bot: ml-services: update readability model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/956381 (owner: 10AikoChou) [10:22:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10brouberol) 05Open→03Declined [10:22:42] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10Volans) @brouberol thanks for the summary and update! Curator is the dependency that mostly creates issues and I think it would be great if we will plan for a pa... [10:24:23] (03PS1) 10Btullis: Retain python2 on the test hadoop standby role [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) [10:25:14] (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [10:25:37] (03CR) 10Nikerabbit: [C: 03+1] Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry) [10:26:10] (03CR) 10Nikerabbit: [C: 03+1] Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [10:26:19] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43194/console" [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:27:52] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:30:32] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:48] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) Thanks for this; I agree that we should probably (virtually) sit down and talk about this; I wanted to try and make sure we had most of the o... [10:38:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:09] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye [10:42:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384 [10:42:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/956375 (owner: 10Jbond) [10:43:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:43:49] (03CR) 10Btullis: [V: 03+1 C: 03+2] Retain python2 on the test hadoop standby role [puppet] - 10https://gerrit.wikimedia.org/r/956383 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:44:46] (03PS1) 10Elukey: ml-services: add REQUESTS_CA_BUNDLE env var to rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890) [10:46:54] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384 (owner: 10Volans) [10:48:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:49:01] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:50:27] (03CR) 10Volans: [C: 03+1] "LGTM, I think though that this will need a manual cleanup of the existing checkout." [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri) [10:50:54] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.2.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/956384 (owner: 10Volans) [10:51:41] (03CR) 10Elukey: [C: 03+2] ml-services: add REQUESTS_CA_BUNDLE env var to rec-api-ng's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/956385 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [10:54:26] (03CR) 10Volans: "reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [10:55:21] (03PS1) 10Volans: Upstream release v7.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/956386 [10:55:30] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v7.2.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/956386 (owner: 10Volans) [10:56:35] (03CR) 10FNegri: [V: 03+1 C: 03+2] [cluster::cloud_management] Don't install prod cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/955937 (https://phabricator.wikimedia.org/T343894) (owner: 10FNegri) [10:57:14] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80) [10:59:13] !log uploaded spicerack_7.2.2 to apt.wikimedia.org bullseye-wikimedia [10:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:00] (03PS1) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) [11:03:25] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) 05Open→03Declined I merged the patch above and cleaned up the SRE cookbooks from cloudcumin[1-2]... [11:05:00] (03PS1) 10Clément Goubert: mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) [11:05:29] (03PS2) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) [11:05:37] (03CR) 10CI reject: [V: 04-1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:06:14] !log installed spicearck v7.2.2 on both cumin hosts [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:17] XioNoX: ^^^ [11:06:22] all yous [11:06:27] thanks! [11:06:34] I'll give it a try later on [11:06:40] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:06:55] Heads up! I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/956372/ with Amir1: [11:07:51] gl! [11:07:58] RECOVERY - HTTPS Ganeti RAPI esams on ganeti3007 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.015 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:08:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by isaranto@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [11:08:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:34] RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:50] (03Merged) 10jenkins-bot: ores-extension: enable lw in enwiki and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956372 (https://phabricator.wikimedia.org/T342115) (owner: 10Ilias Sarantopoulos) [11:09:06] !log isaranto@deploy1002 Started scap: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]] [11:09:09] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [11:10:39] !log isaranto@deploy1002 isaranto: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:13:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:13:47] (03PS4) 10Winston Sung: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) (owner: 10Srishakatux) [11:19:29] (03PS3) 10Muehlenhoff: Remove debian::codename::require::min() checks for Buster [puppet] - 10https://gerrit.wikimedia.org/r/955909 [11:20:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [11:23:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [11:26:36] !log Rebooting poolcounter2004.codfw.wmnet [11:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:40] !log isaranto@deploy1002 isaranto: Continuing with sync [11:26:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter2004.codfw.wmnet [11:28:06] (03CR) 10JMeybohm: [C: 03+1] mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:28:22] (03CR) 10JMeybohm: [C: 03+1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [11:30:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2004.codfw.wmnet [11:31:49] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Infrastructure: "Ensure hosts are not performing a change on every puppet run" alert is failing - https://phabricator.wikimedia.org/T345909 (10jbond) 05In progress→03Resolved a:03jbond >>! In T345909#9155302, @fgiunchedi wrote: > Looks like the alert is... [11:31:59] (03CR) 10Jbond: [C: 03+2] ferm: Add force true to force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/956375 (owner: 10Jbond) [11:32:53] !log isaranto@deploy1002 Finished scap: Backport for [[gerrit:956372|ores-extension: enable lw in enwiki and wikidata (T342115)]] (duration: 23m 46s) [11:32:56] T342115: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 [11:35:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [11:37:54] (03CR) 10JMeybohm: [C: 03+1] "LGTM, thanks for fixing kafka as well!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 (owner: 10Ebernhardson) [11:39:35] (03CR) 10JMeybohm: [C: 03+1] flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson) [11:41:58] !log Rebooting poolcounter2003.codfw.wmnet [11:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:05] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host poolcounter2003.codfw.wmnet [11:42:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [11:42:45] !log setting binlog format to STATEMENT in x1 eqiad and codfw masters (T337310) [11:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:49] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [11:43:48] (03PS1) 10Muehlenhoff: ferm: Move more files under the service check conditional [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) [11:45:19] (03CR) 10Urbanecm: [C: 03+1] Enable PageNotice on enwiktionary beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [11:45:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2003.codfw.wmnet [11:46:11] (03PS1) 10Ayounsi: Routinator: tmpfs, bump the maximum number of inodes [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) [11:51:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C [11:51:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2014.codfw.wmnet to cluster codfw and group C [11:51:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "While I am not the best person to weigh on the spark front, the version split approach seems fine. However, there is the caveat, that you " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:52:53] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) 05Resolved→03Open I wanted to re-add the node to the ganeti cluster, but it seems after the mainboard replacement virtualisation is no longer enabled in BIOS, can you please enable that? [11:53:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:54:24] (03PS1) 10Ladsgroup: Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) [11:59:49] (03CR) 10Btullis: Refactor spark support to build multiple minor versions (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [12:00:15] (03CR) 10Marostegui: [C: 03+1] Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup) [12:03:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Aklapper) Hi (and welcome)! The Phabricator account @Ahoelzl is currently connected to a [personal MediaWiki account](https://phabricator.wikimedia.org/p/Ahoelzl/) and not t... [12:04:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [12:06:22] (03PS1) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826) [12:07:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43195/console" [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:08:00] (03PS1) 10Arturo Borrero Gonzalez: wmcs: drop cloudservices1004 addresses [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621) [12:08:07] (03CR) 10Jelto: [C: 03+1] "let me know if you want me to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80) [12:09:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956414 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:11:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [12:11:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [12:13:44] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove python3-build-jessie (Jessie is EOL) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941442 (owner: 10Hashar) [12:14:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621) [12:15:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [12:17:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [12:17:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956411 (https://phabricator.wikimedia.org/T300955) (owner: 10Ayounsi) [12:18:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [12:18:39] (03Abandoned) 10Hashar: ci: enabling docker requires the docker-ce package [puppet] - 10https://gerrit.wikimedia.org/r/935471 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar) [12:18:43] (03Abandoned) 10Hashar: ci: setup dockervolume before Docker daemon [puppet] - 10https://gerrit.wikimedia.org/r/935405 (https://phabricator.wikimedia.org/T341051) (owner: 10Hashar) [12:18:45] !log installing libssh2 security updates [12:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:21:51] !log restarting apache/FPM on mediawiki canaries [12:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [12:23:36] (03PS1) 10Kevin Bazira: ml-services: increase the recommendation-api-ng memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/956017 (https://phabricator.wikimedia.org/T339890) [12:25:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance [12:25:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:25:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1137.eqiad.wmnet with reason: Maintenance [12:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52408 and previous config saved to /var/cache/conftool/dbconfig/20230911-122535-ladsgroup.json [12:25:39] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [12:25:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:26:55] (03PS1) 10Arturo Borrero Gonzalez: wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621) [12:27:19] jouncebot: nowandnext [12:27:19] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [12:27:19] In 0 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300) [12:29:44] (03CR) 10David Caro: [C: 03+1] openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:30:32] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [12:31:57] (03CR) 10David Caro: [C: 03+1] wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:31:59] (03PS13) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [12:32:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: refresh cloudservices1006 ns address [puppet] - 10https://gerrit.wikimedia.org/r/956417 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:32:12] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:32:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: refresh DNS addresses [puppet] - 10https://gerrit.wikimedia.org/r/956419 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:33:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: drop cloudservices1004 addresses [dns] - 10https://gerrit.wikimedia.org/r/956415 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [12:35:52] (03CR) 10Ladsgroup: [C: 03+2] Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup) [12:36:30] (03PS7) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [12:36:48] (03Merged) 10jenkins-bot: Add drop_notification_seen_T337310.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/956412 (https://phabricator.wikimedia.org/T337310) (owner: 10Ladsgroup) [12:37:03] (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [12:37:07] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [12:37:57] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [12:37:58] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:39:45] (03PS1) 10Jelto: gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422 [12:39:48] (03PS1) 10Arturo Borrero Gonzalez: wmcs: eqiad1: drop ns1-next and use ns1 [puppet] - 10https://gerrit.wikimedia.org/r/956423 (https://phabricator.wikimedia.org/T345240) [12:40:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: eqiad1: drop ns1-next and use ns1 [puppet] - 10https://gerrit.wikimedia.org/r/956423 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:52:12] (03PS1) 10Majavah: cr-cloud: add ns-recursor.openstack.eqiad1 [homer/public] - 10https://gerrit.wikimedia.org/r/956429 (https://phabricator.wikimedia.org/T342621) [12:53:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr-cloud: add ns-recursor.openstack.eqiad1 [homer/public] - 10https://gerrit.wikimedia.org/r/956429 (https://phabricator.wikimedia.org/T342621) (owner: 10Majavah) [12:59:27] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [12:59:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:59:49] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1300). [13:00:04] Func, kart_, and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:18] \0 [13:01:00] I will also deploy abijeet's patch. [13:01:14] ok! [13:01:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:01:53] 10Puppet: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:01:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: remove from role::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/944920 (owner: 10Giuseppe Lavagetto) [13:02:06] Func: You can start with your patch. [13:02:25] I am not a deployer ;) [13:04:20] i am semi-around but busy with a wmcs issue, sorry [13:04:24] ah. I have no idea about patch. Anyone else can deploy it? [13:04:37] Lucas_WMDE: ^^ [13:05:01] I’d prefer not to deploy, but let me take a look [13:05:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [13:06:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:25] ok let’s try it I guess [13:06:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [13:06:35] (03PS1) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/956432 (https://phabricator.wikimedia.org/T342361) [13:06:38] Func: if the logspam recurs I assume it’ll be noticeable quickly and we can safely revert again? [13:06:50] yeah [13:07:13] ok, good enough for me [13:07:14] (fwiw the internal DNS issues are at the very least causing beta cluster deploys to fail, so ymmv) [13:07:18] Lucas_WMDE: as an fyi, beta CI is broken [13:07:23] ok [13:07:29] but it’s not expected to affect production right? [13:07:29] I'm not sure how normal CI works [13:07:42] let’s try it out [13:07:47] if it fails I’ll know why, thanks [13:07:56] Lucas_WMDE: no production impact but if CI throws weird errors, it's known [13:08:02] (03PS3) 10Lucas Werkmeister (WMDE): Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [13:08:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [13:08:51] (03Merged) 10jenkins-bot: Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956050 (https://phabricator.wikimedia.org/T340697) (owner: 10Func) [13:09:10] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:956050|Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" (T340697)]] [13:09:18] T340697: Remove $wgNamespacesWithSubpages overrides for the MediaWiki namespace in production - https://phabricator.wikimedia.org/T340697 [13:09:18] (03PS3) 10KartikMistry: Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [13:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52409 and previous config saved to /var/cache/conftool/dbconfig/20230911-131001-ladsgroup.json [13:10:08] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [13:10:38] Func: I’m confused by some of the diffConfig output https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/4260/console [13:10:43] there seem to be a lot of "14"s affected [13:11:01] and also e.g. conf-production-zh_yuewiki.json has "16" removed too [13:11:14] !log lucaswerkmeister-wmde@deploy1002 func and lucaswerkmeister-wmde: Backport for [[gerrit:956050|Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" (T340697)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:11:35] is this really correct? [13:11:44] (I should’ve checked this before +2ing, really) [13:11:46] eh le me check [13:12:21] (03CR) 10Ayounsi: [C: 03+2] makevm: handle sandbox vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:13:20] in `mwscript shell zuwikibooks`, `$namespaceInfo->hasSubpages(14)` returns false on mwdebug1002 [13:13:39] but true on mwmaint1002 [13:13:58] (14 being NS_CATEGORY) [13:14:44] (03Merged) 10jenkins-bot: makevm: handle sandbox vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/955730 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [13:15:27] Lucas_WMDE: eh, I don't know how that is possible, maybe we don't deploy this time [13:15:31] (03PS2) 10Giuseppe Lavagetto: noc: remove profile, module [puppet] - 10https://gerrit.wikimedia.org/r/944921 [13:16:06] Func: yeah, I don’t understand it either [13:16:14] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host atlas3001.wikimedia.org [13:16:15] I’ll say `n` to scap backport [13:16:15] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:16:22] and find out if it reverts itself or if I need to upload a revert manually ^^ [13:16:26] !log lucaswerkmeister-wmde@deploy1002 Sync cancelled. [13:16:33] ok, it just cancels the sync [13:16:35] * Lucas_WMDE reverts [13:16:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: remove profile, module [puppet] - 10https://gerrit.wikimedia.org/r/944921 (owner: 10Giuseppe Lavagetto) [13:17:05] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 [13:17:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE)) [13:18:00] (03CR) 10Lucas Werkmeister (WMDE): "(I should have checked the diffConfig before merging that other change, my bad. It shouldn’t have been merged at all, then this revert wou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE)) [13:18:02] (03Merged) 10jenkins-bot: Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956057 (owner: 10Lucas Werkmeister (WMDE)) [13:18:18] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]] [13:18:25] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1001" [13:19:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas3001.wikimedia.org - ayounsi@cumin1001" [13:19:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:19:12] !log ayounsi@cumin1001 START - Cookbook sre.dns.wipe-cache atlas3001.wikimedia.org on all recursors [13:19:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas3001.wikimedia.org on all recursors [13:19:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:19:42] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1001" [13:19:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:07] I’ll let this sync go through so I’m sure everything is on the same page [13:20:08] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:20:15] (though it shouldn’t be necessary, strictly speaking) [13:20:22] I wish scap backport will always say Y :) [13:20:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas3001.wikimedia.org - ayounsi@cumin1001" [13:20:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas3001.wikimedia.org [13:20:42] PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:56] Lucas_WMDE: let me know when scap is done. [13:21:21] will do [13:22:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P52411 and previous config saved to /var/cache/conftool/dbconfig/20230911-132210-root.json [13:24:02] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved @MoritzMuehlenhoff it's enabled now. [13:24:22] RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [13:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52412 and previous config saved to /var/cache/conftool/dbconfig/20230911-132507-ladsgroup.json [13:26:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [13:26:23] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:956057|Revert "Reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace""]] (duration: 08m 04s) [13:26:45] OK. It seems done. [13:27:24] I'll deploy abijeet's patch. Skipping my patch. [13:27:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43201/console" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [13:27:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [13:28:16] * Lucas_WMDE done [13:28:18] kart_: go ahead [13:28:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:28:28] (sorry, I got distracted for a minute) [13:28:30] (03Merged) 10jenkins-bot: Enable MinT translation service in more wikis - rollout #3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956051 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [13:28:33] kart_, thanks! [13:28:48] !log kartik@deploy1002 Started scap: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]] [13:28:51] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [13:29:35] Lucas_WMDE: No problem! [13:30:13] !log kartik@deploy1002 kartik and abi: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:30:34] abijeet: can you test the patch using mwdebug? [13:30:46] kart_, sure. [13:31:29] Let me know if eveything is OK [13:32:14] (03CR) 10LSobanski: [C: 03+1] gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto) [13:32:14] kart_, looks good. [13:32:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [13:33:03] cool. Going ahead. [13:33:42] (03PS1) 10Func: [WIP] Re-reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956058 [13:33:44] !log kartik@deploy1002 kartik and abi: Continuing with sync [13:33:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [13:34:11] (03PS2) 10Func: [WIP] Re-reapply "Remove $wgNamespacesWithSubpages overrides on the MediaWiki namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956058 [13:36:13] (03CR) 10Jelto: [V: 03+1 C: 03+1] "looks mostly good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [13:36:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [13:40:07] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956051|Enable MinT translation service in more wikis - rollout #3 (T341445)]] (duration: 11m 18s) [13:40:10] T341445: Enable MinT for translatable pages - https://phabricator.wikimedia.org/T341445 [13:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P52413 and previous config saved to /var/cache/conftool/dbconfig/20230911-134013-ladsgroup.json [13:40:55] We are done now, abijeet :) [13:41:44] kart_, thanks [13:43:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [13:43:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [13:43:20] RECOVERY - Host mw2444 is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms [13:43:24] PROBLEM - Check systemd state on mw2444 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:26] PROBLEM - puppet last run on mw2444 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:44:32] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) I opened a Dell support ticket to get a replacement. I've rebooted it for now but expect it to go down again. SR: 175669963 [13:44:52] RECOVERY - Check systemd state on mw2444 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:37] (03PS2) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890) [13:48:52] RECOVERY - puppet last run on mw2444 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:49:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [13:49:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [13:51:10] (03PS1) 10Ayounsi: Add atlas_group (VMs) to RIPE atlas policy [homer/public] - 10https://gerrit.wikimedia.org/r/956435 (https://phabricator.wikimedia.org/T307021) [13:51:26] (03PS2) 10Jelto: gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422 [13:52:38] (03CR) 10Jelto: gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto) [13:54:46] (03CR) 10LSobanski: gitlab: use UUID in provision filesystem script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto) [13:55:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T337310)', diff saved to https://phabricator.wikimedia.org/P52414 and previous config saved to /var/cache/conftool/dbconfig/20230911-135520-ladsgroup.json [13:55:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:55:24] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [13:55:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [13:55:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [13:55:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [13:56:04] (03CR) 10Jelto: [C: 03+2] gitlab: use UUID in provision filesystem script [puppet] - 10https://gerrit.wikimedia.org/r/956422 (owner: 10Jelto) [13:56:20] (03PS1) 10Majavah: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439 [13:57:07] (03CR) 10David Caro: [C: 03+1] "LGTM, did not test it though, let me know if you want a thorough test" [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [13:59:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [13:59:56] (03CR) 10Herron: [C: 03+1] superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [14:02:33] (03CR) 10Btullis: [C: 03+1] "Fab! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [14:05:23] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [14:06:12] (03PS1) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 [14:06:14] (03PS1) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890) [14:07:22] (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:40] (03PS2) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 [14:07:42] (03PS2) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890) [14:07:44] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43202/console" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:08:31] (03PS1) 10Arturo Borrero Gonzalez: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059 [14:09:04] (03CR) 10Majavah: [C: 03+1] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059 (owner: 10Arturo Borrero Gonzalez) [14:11:24] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9152935, @Eevans wrote: >>>! In T344259#9152542, @Jclark-ctr wrote: >> Replaced optic and cable again @cmooney @Eevans > > Thanks @Jclark-ctr. U... [14:11:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Add a dependency on the opensearch-py client - https://phabricator.wikimedia.org/T345900 (10colewhite) Linking my comment here for visibility: T345337#9150551 [14:11:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956059 (owner: 10Arturo Borrero Gonzalez) [14:12:08] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. Keep in mind that GitLab::Projects contains issues, wiki and snippets currently. So if we want to disable more, we have expand the t" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:15:32] (03PS3) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T339890) [14:15:46] Func: aaaaaAAAAAHHH! [14:15:47] (re https://phabricator.wikimedia.org/T340697#9156521) [14:15:53] that’s terrifying [14:16:32] yeah, it even affects $wgNamespaceProtection [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:17:52] * Lucas_WMDE sprays mediawiki-config with holy water [14:18:31] (03PS1) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) [14:19:02] (03CR) 10CI reject: [V: 04-1] Add: roll restart/reboot command for opensearch cluster nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [14:19:04] (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris) [14:19:34] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm [14:23:10] (03CR) 10Milimetric: "@Daniel - just added you since I didn't see this setting/content handler mapping used anywhere, and wondered what you thought about this i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [14:24:20] (03PS8) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [14:25:20] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [14:28:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert the defective DIMM has been replaced and booted up. Error hasn't repeated yet. `The self-heal operation suc... [14:28:54] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/953631/43203/" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [14:30:08] (03PS1) 10Mhorsey: Enable Campaign Events email feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) [14:30:32] (03CR) 10Mhorsey: [C: 04-1] "Do not merge until deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) (owner: 10Mhorsey) [14:30:41] (03PS1) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [14:30:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1220.eqiad.wmnet with reason: Maintenance [14:30:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1220.eqiad.wmnet with reason: Maintenance [14:31:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52416 and previous config saved to /var/cache/conftool/dbconfig/20230911-143102-ladsgroup.json [14:31:06] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [14:33:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T345802 (10Clement_Goubert) Thanks @Jhancock.wm ! [14:33:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) [14:34:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43204/console" [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [14:35:58] (03PS9) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [14:37:47] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [14:38:38] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [14:39:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52417 and previous config saved to /var/cache/conftool/dbconfig/20230911-143937-ladsgroup.json [14:39:41] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [14:40:23] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) 05Open→03In progress p:05Triage→03Medium [14:40:29] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [14:42:23] (03PS2) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) [14:42:59] (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris) [14:48:29] (03PS1) 10Andrea Denisse: netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) [14:51:27] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) I'm running into a similar issue while reimaging `cloudnet2005-dev.codfw.wmnet` to Bookworm. ` fnegri@cumin1001:~$ sudo cookbook sre.hosts.re... [14:51:49] (03CR) 10Ayounsi: [C: 03+1] Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439 (owner: 10Majavah) [14:52:05] (03CR) 10Ayounsi: [C: 03+2] Add atlas_group (VMs) to RIPE atlas policy [homer/public] - 10https://gerrit.wikimedia.org/r/956435 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [14:52:33] (03PS3) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) [14:53:19] (03CR) 10CI reject: [V: 04-1] machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) (owner: 10Alexandros Kosiaris) [14:53:40] (03PS1) 10Ayounsi: Add esams sandbox network prefixes [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021) [14:54:11] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet [14:54:14] (03CR) 10Jelto: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [14:54:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52418 and previous config saved to /var/cache/conftool/dbconfig/20230911-145443-ladsgroup.json [14:55:26] (03CR) 10Effie Mouzeli: [C: 03+1] mw-on-k8s: Raise traffic to 5% [puppet] - 10https://gerrit.wikimedia.org/r/956390 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [14:55:38] PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:12] !log brouberol@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1149.eqiad.wmnet [14:57:00] RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, we'll need the DNS change too to go with this, like https://gerrit.wikimedia.org/r/c/operations/dns/+/616709 (no smokeping anymore t" [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [14:57:28] (03CR) 10Effie Mouzeli: mw-api-ext, mw-web: Raise total replicas to 14 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [14:58:25] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) hey @Jclark-ctr or @Jhancock.wm it would be good for us to know when this reracking can be done in advance, to have the less downtime in... [14:59:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:00:31] * volans looking [15:00:58] (03PS3) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) [15:01:02] (03CR) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 14 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [15:01:20] (03PS1) 10Andrea Denisse: wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) [15:02:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1031.mgmt.eqiad.wmnet with reboot policy FORCED [15:03:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, we're not lowering the TTL but I think that's good enough, we can force-refresh as needed" [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [15:03:31] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:04:01] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:04:08] (03PS2) 10Andrea Denisse: wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) [15:04:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:05:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [15:06:13] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) After working with Dell, we determined that the drive is bad and they will be sending a replacement. [15:06:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [15:06:20] PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:33] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:33] 10SRE, 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Jhancock.wm) p:05Triage→03Medium [15:07:35] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:36] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:38] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1039.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:39] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1040.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1041.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:42] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1042.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:43] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1043.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:45] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1044.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:52] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) p:05Triage→03Medium [15:09:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220', diff saved to https://phabricator.wikimedia.org/P52419 and previous config saved to /var/cache/conftool/dbconfig/20230911-150950-ladsgroup.json [15:15:06] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:09] !log jnuche@deploy1002 Installing scap version "4.60.0" for 595 hosts [15:18:34] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "cr-cloud: add ns-recursor.openstack.eqiad1" [homer/public] - 10https://gerrit.wikimedia.org/r/956439 (owner: 10Majavah) [15:19:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:57] !log jnuche@deploy1002 Installing scap version "4.60.0" for 595 hosts [15:21:35] (03CR) 10Muehlenhoff: [C: 03+2] ferm: Move more files under the service check conditional [puppet] - 10https://gerrit.wikimedia.org/r/956410 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:21:59] !log jnuche@deploy1002 Installation of scap version "4.60.0" completed for 595 hosts [15:23:17] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2005-dev.codfw.wmnet with OS bookworm [15:24:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1220 (T337310)', diff saved to https://phabricator.wikimedia.org/P52420 and previous config saved to /var/cache/conftool/dbconfig/20230911-152456-ladsgroup.json [15:24:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:25:01] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [15:25:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:25:40] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [15:25:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1042.mgmt.eqiad.wmnet with reboot policy FORCED [15:25:59] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1041.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1039.mgmt.eqiad.wmnet with reboot policy FORCED [15:27:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1044.mgmt.eqiad.wmnet with reboot policy FORCED [15:28:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:28:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1043.mgmt.eqiad.wmnet with reboot policy FORCED [15:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1530). [15:31:35] hello! Please who can I reach out to add an apple verification file to the .well-known directory in donate.wiki [15:32:10] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [15:33:52] (03PS1) 10Muehlenhoff: Pass ensure->present to the nftables class if selected [puppet] - 10https://gerrit.wikimedia.org/r/956461 [15:34:05] damilare: probably best asked in the less noisy #wikimedia-sre [15:34:21] Unless fundraising own it [15:35:06] thanks RhinosF1, looks like it's more on the side of prod ops. Thanks for the link. [15:36:04] (03CR) 10Elukey: [C: 03+1] Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:36:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1040.mgmt.eqiad.wmnet with reboot policy FORCED [15:37:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff) [15:40:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1052.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:01] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1051.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1053.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:20] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1049.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:23] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1050.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:25] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/955961/43205/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/955961 (https://phabricator.wikimedia.org/T329219) (owner: 10Ssingh) [15:41:27] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1045.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:29] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1046.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:42] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1048.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:46] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet [15:43:08] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:43:22] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:43:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52421 and previous config saved to /var/cache/conftool/dbconfig/20230911-154327-arnaudb.json [15:43:31] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:43:48] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bookworm [15:44:12] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1149.eqiad.wmnet [15:45:07] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:45:37] (03CR) 10Brennen Bearnes: [C: 03+1] "Post hoc, but this seems reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [15:47:18] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes1047 - jclark@cumin1001" [15:48:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt kubernetes1047 - jclark@cumin1001" [15:48:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:33] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1047.mgmt.eqiad.wmnet with reboot policy FORCED [15:48:51] (03CR) 10Effie Mouzeli: [C: 03+1] "excellent!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956388 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [15:49:47] (03PS1) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 [15:49:59] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) Quick status update; this has seen agreement in the IF SRE meeting, the next step is to sort out which SRE would take care the day-... [15:51:33] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:55:09] (03CR) 10Herron: [C: 03+2] profile::prometheus::statsd_exporter: add support for empty mappings [puppet] - 10https://gerrit.wikimedia.org/r/955838 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [15:55:18] (03PS1) 10Arturo Borrero Gonzalez: wmcs: remove ns-recursorX FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621) [15:55:56] (03CR) 10Andrea Denisse: [C: 03+2] superset: Move superset metrics to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/955776 (https://phabricator.wikimedia.org/T345790) (owner: 10Andrea Denisse) [15:56:10] (03CR) 10Func: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func) [15:57:17] (03CR) 10Jelto: [C: 03+1] "looks good to me for the gitlab firewall config, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [15:57:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: remove ns-recursorX FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/956463 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [15:59:35] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1150.eqiad.wmnet [16:00:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:00:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:01:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:33] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1150.eqiad.wmnet [16:03:03] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [16:04:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1045.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1048.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:18] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) >>! In T344259#9156545, @Eevans wrote: >>>! In T344259#9152935, @Eevans wrote: >>>>! In T344259#9152542, @Jclark-ctr wrote: >>> Replaced optic and cable again @cm... [16:04:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1053.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1046.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1049.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1051.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1052.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1050.mgmt.eqiad.wmnet with reboot policy FORCED [16:04:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 (owner: 10Slyngshede) [16:05:47] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1151.eqiad.wmnet [16:06:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:10] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [16:06:55] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1047.mgmt.eqiad.wmnet with reboot policy FORCED [16:07:56] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1151.eqiad.wmnet [16:08:35] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bookworm [16:10:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [16:10:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [16:10:39] !log brouberol@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1152.eqiad.wmnet [16:11:02] (03PS1) 10Tchanders: Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) [16:11:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [16:12:35] (03PS1) 10Ssingh: dnsdist: update dnsdist conf version comment [puppet] - 10https://gerrit.wikimedia.org/r/956466 [16:12:49] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1152.eqiad.wmnet [16:13:15] (03PS1) 10Tchanders: Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) [16:13:33] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43207/console" [puppet] - 10https://gerrit.wikimedia.org/r/956466 (owner: 10Ssingh) [16:14:16] (03PS2) 10Ssingh: dnsdist: update configuration file for version comment [puppet] - 10https://gerrit.wikimedia.org/r/956466 [16:14:46] (03CR) 10Jforrester: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:16:32] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [16:16:33] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:17:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1055.mgmt.eqiad.wmnet with reboot policy FORCED [16:17:53] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1054.mgmt.eqiad.wmnet with reboot policy FORCED [16:18:04] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1056.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:00] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [16:19:25] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:08] (03PS1) 10Majavah: icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 [16:21:33] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:22:44] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10fnegri) Ignore my previous comment, this turned out to be a one-off issue with the reimage cookbook. Restarting the cookbook a second time, it worked... [16:25:39] (03PS2) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 [16:25:41] (03PS5) 10Func: SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) [16:25:43] (03PS1) 10Eevans: install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) [16:26:07] (03CR) 10Eevans: [C: 03+1] install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [16:26:34] (03CR) 10Func: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:28:30] (03CR) 10Jforrester: [C: 04-1] "No, you can't use composer for this repo to change what code is available to run with, that's what I'm saying. This is now definitely-wron" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:28:39] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [16:31:18] !log denisse@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host netmon2002.wikimedia.org with OS bookworm [16:31:33] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:47] (03CR) 10Func: composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:32:29] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [16:32:44] (03CR) 10Hnowlan: [C: 03+1] install: Use from-scratch partman recipe for restbase1030 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [16:32:52] (03PS4) 10Alexandros Kosiaris: machinetranslation: Add egress mesh template [deployment-charts] - 10https://gerrit.wikimedia.org/r/956444 (https://phabricator.wikimedia.org/T335491) [16:33:00] (03CR) 10Jforrester: "This is fine for test code, but won't work if someone copies to production code (which doesn't have any composer auto-loaded stuff, it run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func) [16:33:50] (03CR) 10Jforrester: [C: 04-1] composer: Install symfony/polyfill-php8x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:34:12] (03CR) 10Eevans: [C: 03+2] install: Use from-scratch partman recipe for restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/956471 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [16:41:02] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [16:41:16] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [16:42:07] (03CR) 10David Caro: [C: 03+1] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah) [16:42:14] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah) [16:42:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance [16:42:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2096.codfw.wmnet with reason: Maintenance [16:42:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52423 and previous config saved to /var/cache/conftool/dbconfig/20230911-164249-ladsgroup.json [16:43:57] (03CR) 10Majavah: [C: 03+2] icinga: add myself to ACLs [puppet] - 10https://gerrit.wikimedia.org/r/956470 (owner: 10Majavah) [16:45:16] (03PS2) 10Ebernhardson: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 [16:46:17] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [16:47:17] (03PS9) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [16:48:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1054.mgmt.eqiad.wmnet with reboot policy FORCED [16:48:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1056.mgmt.eqiad.wmnet with reboot policy FORCED [16:48:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1055.mgmt.eqiad.wmnet with reboot policy FORCED [16:49:21] (03PS1) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [16:50:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [16:52:28] (03Abandoned) 10Func: composer: Install symfony/polyfill-php8x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956022 (owner: 10Func) [16:57:14] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED [16:58:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52424 and previous config saved to /var/cache/conftool/dbconfig/20230911-165802-ladsgroup.json [16:58:07] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [16:59:34] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1027.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:45] (03PS2) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [16:59:57] (03PS3) 10Bking: rdf-streaming-updater-k8s: Add egress rules to values [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1700) [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T1700). [17:00:48] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:02:56] (03PS6) 10Func: SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) [17:04:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:06:41] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1030.eqiad.wmnet with reason: host reimage [17:09:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1030.eqiad.wmnet with reason: host reimage [17:11:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956454 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [17:12:26] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:13:02] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/956461 (owner: 10Muehlenhoff) [17:13:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52425 and previous config saved to /var/cache/conftool/dbconfig/20230911-171309-ladsgroup.json [17:15:49] (03CR) 10Jforrester: [C: 03+1] SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) (owner: 10Func) [17:24:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:24:46] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for VRiley - https://phabricator.wikimedia.org/T346077 (10VRiley-WMF) [17:25:28] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH) [17:25:45] 10SRE, 10SRE-Access-Requests: Requesting access to sehll/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10RobH) p:05Triage→03Medium [17:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096', diff saved to https://phabricator.wikimedia.org/P52426 and previous config saved to /var/cache/conftool/dbconfig/20230911-172815-ladsgroup.json [17:28:57] (03PS1) 10RobH: adding valarie to dc ops shell group [puppet] - 10https://gerrit.wikimedia.org/r/956479 (https://phabricator.wikimedia.org/T346077) [17:29:11] (03CR) 10Daniel Kinzler: Map Jade content handler to UnknownContentHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:31:35] (03CR) 10RobH: [C: 03+2] adding valarie to dc ops shell group [puppet] - 10https://gerrit.wikimedia.org/r/956479 (https://phabricator.wikimedia.org/T346077) (owner: 10RobH) [17:31:46] (03PS7) 10Ebernhardson: Add a networkpolicy template for zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/955032 [17:31:48] (03PS3) 10Ebernhardson: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 [17:31:50] (03PS10) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [17:31:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to shell/dcops for VRiley - https://phabricator.wikimedia.org/T346077 (10Aklapper) [17:37:03] (03PS3) 10Ladsgroup: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:37:10] (03CR) 10Ladsgroup: [C: 03+2] Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:38:00] (03Merged) 10jenkins-bot: Map Jade content handler to UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:41:01] (03CR) 10Daniel Kinzler: "Uh, hold on... it was renamed to FallbackContentHandler in 1.34 I think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:43:01] (03PS9) 10Herron: profile::mediawiki::common: include prometheus statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) [17:43:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:09] (03PS10) 10Herron: profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) [17:43:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2096 (T337310)', diff saved to https://phabricator.wikimedia.org/P52427 and previous config saved to /var/cache/conftool/dbconfig/20230911-174321-ladsgroup.json [17:43:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:43:25] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [17:43:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [17:45:15] (03CR) 10Jbond: "Thanks both for the review on this and sorry its taken me so long to pick it up. however would be good to try and get something this week" [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [17:45:24] (03PS4) 10Jbond: rsyslog: update to use pki certificates [puppet] - 10https://gerrit.wikimedia.org/r/936763 (https://phabricator.wikimedia.org/T340741) [17:45:26] (03PS1) 10Jbond: rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [17:46:22] (03CR) 10Jforrester: "This should probably go in the WikimediaMessages extension, which is what we do for undeployed extensions' messages (see https://gerrit.wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:47:12] (03PS1) 10Ladsgroup: Use FallbackContentHandler instead of FakeContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 [17:48:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:52] (03CR) 10Ladsgroup: [C: 03+2] Map Jade content handler to UnknownContentHandler (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:50:15] (03PS2) 10Ladsgroup: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) [17:51:50] (03CR) 10Jforrester: Map Jade content handler to UnknownContentHandler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [17:53:08] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bullseye [17:58:00] (03CR) 10Xcollazo: "If it helps lower the burden here, I think we could drop the Spark 3.2 build, here or elsewhere (Gitlab?). No one depends on it, and 3.3.X" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:58:36] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [17:59:59] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [18:00:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1030.eqiad.wmnet with OS bullseye [18:00:07] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye completed: - restbase10... [18:00:33] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:08:52] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [18:11:08] (03PS1) 10Ssingh: Release 9.2.1-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) [18:11:54] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [18:13:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [18:13:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [18:15:00] (03CR) 10Ssingh: "In the logging patch, note that we are using the Warning and Error macros, but 9.2.1 is using SiteThrottledWarning and SiteThrottledError " [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/956484 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [18:20:10] (03CR) 10Ebernhardson: rdf-streaming-updater-k8s: Add egress rules to values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/956474 (https://phabricator.wikimedia.org/T346048) (owner: 10Bking) [18:25:44] (03CR) 10Bartosz Dziewoński: "Caused T346080?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955819 (https://phabricator.wikimedia.org/T345874) (owner: 10Milimetric) [18:27:17] (03PS3) 10Bartosz Dziewoński: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [18:27:31] (03CR) 10STran: [C: 03+1] Enable partial action blocks on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956465 (https://phabricator.wikimedia.org/T339878) (owner: 10Tchanders) [18:27:37] (03CR) 10Krinkle: [C: 04-1] "I suggest enabling in beta cluster on its own first, maybe for a week or two, and instruct enwiktionary community to test (and demonstrate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/668156 (https://phabricator.wikimedia.org/T61245) (owner: 10TTO) [18:27:48] (03CR) 10STran: [C: 03+1] Enable partial action blocks on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956467 (https://phabricator.wikimedia.org/T332733) (owner: 10Tchanders) [18:28:31] (03CR) 10Ladsgroup: [C: 03+2] Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [18:29:03] (03Merged) 10jenkins-bot: Use FallbackContentHandler instead of UnknownContentHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956482 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [18:33:18] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bullseye [18:35:33] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:37:51] (03PS1) 10Brennen Bearnes: phabricator deployment: restart php when finalizing deploy [puppet] - 10https://gerrit.wikimedia.org/r/956486 (https://phabricator.wikimedia.org/T314460) [18:42:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [18:42:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2131.codfw.wmnet with reason: Maintenance [18:42:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52428 and previous config saved to /var/cache/conftool/dbconfig/20230911-184231-ladsgroup.json [18:42:35] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [18:43:19] (03PS2) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [18:47:02] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [18:57:38] (03CR) 10Lucas Werkmeister: [C: 03+1] Add lucaswerkmeister.de to Planet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80) [18:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52429 and previous config saved to /var/cache/conftool/dbconfig/20230911-185813-ladsgroup.json [18:58:18] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [19:03:02] (03PS5) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) [19:09:32] PROBLEM - cassandra-b service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:34] PROBLEM - cassandra-c service on restbase1030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:38] PROBLEM - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.234 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:09:54] PROBLEM - cassandra-c SSL 10.64.48.236:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:10:12] PROBLEM - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.236 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:10:16] PROBLEM - cassandra-b SSL 10.64.48.235:7000 on restbase1030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [19:10:26] PROBLEM - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is CRITICAL: connect to address 10.64.48.235 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52430 and previous config saved to /var/cache/conftool/dbconfig/20230911-191320-ladsgroup.json [19:14:11] Got those ^^^ [19:18:20] (03PS3) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [19:28:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P52431 and previous config saved to /var/cache/conftool/dbconfig/20230911-192826-ladsgroup.json [19:31:25] (03PS1) 10Jforrester: [mathoid] Switch image to GitLab-published one [deployment-charts] - 10https://gerrit.wikimedia.org/r/956492 (https://phabricator.wikimedia.org/T344747) [19:38:49] (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [19:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T337310)', diff saved to https://phabricator.wikimedia.org/P52432 and previous config saved to /var/cache/conftool/dbconfig/20230911-194332-ladsgroup.json [19:43:37] T337310: Remove notification_seen column from echo_notifications database table - https://phabricator.wikimedia.org/T337310 [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:13] (03PS1) 10Andrew Bogott: nova-fullstack: check dns via auth server rather than recursor [puppet] - 10https://gerrit.wikimedia.org/r/956497 (https://phabricator.wikimedia.org/T346092) [20:01:25] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/955973 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [20:01:35] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test: don't assume project_name == project_id [puppet] - 10https://gerrit.wikimedia.org/r/956088 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [20:02:27] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: check dns via auth server rather than recursor [puppet] - 10https://gerrit.wikimedia.org/r/956497 (https://phabricator.wikimedia.org/T346092) (owner: 10Andrew Bogott) [20:03:11] good, nothing to do! :) [20:05:06] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:09:11] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt titan1001 - jclark@cumin1001" [20:09:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt titan1001 - jclark@cumin1001" [20:09:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:10:03] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:12:13] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt titan1001 - jclark@cumin1001" [20:12:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:13:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt titan1001 - jclark@cumin1001" [20:13:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:25] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1001 [20:13:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1001 [20:13:35] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1002 [20:14:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes1029.mgmt.eqiad.wmnet with reboot policy FORCED [20:17:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1002 [20:17:10] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host titan1001 [20:17:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host titan1001 [20:18:48] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host titan1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:18:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host titan1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:26:51] (03CR) 10Jeena Huneidi: "Well, it is used by this repository: https://gerrit.wikimedia.org/g/releng/local-charts, but I don't think that repo is used by anyone. We" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester) [20:37:23] (03PS1) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [20:39:15] (03CR) 10Jeena Huneidi: [C: 03+2] update_version: tox.ini: whitelist_externals -> allowlist_externals [deployment-charts] - 10https://gerrit.wikimedia.org/r/955875 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [20:39:46] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan1001.mgmt.eqiad.wmnet with reboot policy FORCED [20:39:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:41:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:41:25] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:41:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:41:47] (03PS2) 10Jdlrobson: WIP: Logos for special projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956502 (https://phabricator.wikimedia.org/T341242) [20:41:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:41:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['titan1001.eqiad.wmnet'] [20:43:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:48:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:49:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) [20:51:56] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230911T2100). [21:01:08] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.007e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:04:33] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:04:58] Hey all - I’ve got one, maybe two patches for today’s security window. [21:10:02] (03PS1) 10Jclark-ctr: Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179) [21:11:27] (03CR) 10Jclark-ctr: [C: 03+2] Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179) (owner: 10Jclark-ctr) [21:19:40] !log Deployed security fix for T345693 [21:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:10] (03CR) 10RobH: [C: 03+2] Add titan100[1-2} site.pp [puppet] - 10https://gerrit.wikimedia.org/r/956506 (https://phabricator.wikimedia.org/T342179) (owner: 10Jclark-ctr) [21:24:13] (03CR) 10Cwhite: [C: 03+2] aptrepo: amend pin to allow grafana 9.4.x [puppet] - 10https://gerrit.wikimedia.org/r/955014 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [21:24:33] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:32:38] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm [21:32:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host titan1001.eqiad.wmnet with OS bookworm [21:32:46] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm [21:32:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host titan1002.eqiad.wmnet with OS bookworm [21:33:19] !log update grafana to 9.4.14 on grafana1002 T345362 [21:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:22] T345362: DatasourceError grafana alerting error message database is locked - https://phabricator.wikimedia.org/T345362 [21:36:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10Jclark-ctr) [21:37:48] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE, 10Patch-For-Review: Requesting Creation of a new POSIX group and system user for the Analytics WMDE team. - https://phabricator.wikimedia.org/T345726 (10RLazarus) a:03joanna_borun Hi @joanna_borun -- does this need Infrastructure Foundations approval? [21:43:08] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:47:40] (03PS1) 10Eevans: Revert "install: Use from-scratch partman recipe for restbase1030" [puppet] - 10https://gerrit.wikimedia.org/r/956063 [21:51:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:55:07] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Failing SSD for restbase1030.eqiad.wmnet - https://phabricator.wikimedia.org/T344259 (10Eevans) 05Open→03Resolved >>! In T344259#9157174, @Eevans wrote: >>>! In T344259#9156545, @Eevans wrote: >> [ ... ] >> @Jclark-ctr could we try connecting something el... [22:03:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) Hi @Ahoelzl, welcome to the Foundation! SRE here, I'll be able to set you up with production access. The SSH key you provided is the same one you're already using... [22:03:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10RLazarus) [22:25:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52434 and previous config saved to /var/cache/conftool/dbconfig/20230911-222536-arnaudb.json [22:25:40] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:25:44] (03CR) 10RLazarus: [C: 03+1] envoyproxy: tox.ini: whitelist_externals -> allowlist_externals [puppet] - 10https://gerrit.wikimedia.org/r/955876 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [22:40:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P52435 and previous config saved to /var/cache/conftool/dbconfig/20230911-224042-arnaudb.json [22:42:26] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:52:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1001.eqiad.wmnet with OS bookworm [22:52:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host titan1001.eqiad.wmnet with OS bookworm executed with errors:... [22:53:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1002.eqiad.wmnet with OS bookworm [22:53:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host titan1002.eqiad.wmnet with OS bookworm executed with errors:... [22:55:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P52436 and previous config saved to /var/cache/conftool/dbconfig/20230911-225548-arnaudb.json [23:02:26] (03CR) 10Cwhite: [C: 03+1] netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [23:07:43] (03CR) 10Cwhite: [V: 03+1] "Tests ok on grafana2001. Ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [23:10:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343198)', diff saved to https://phabricator.wikimedia.org/P52437 and previous config saved to /var/cache/conftool/dbconfig/20230911-231054-arnaudb.json [23:10:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:11:00] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:11:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:11:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:11:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:11:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343198)', diff saved to https://phabricator.wikimedia.org/P52438 and previous config saved to /var/cache/conftool/dbconfig/20230911-231131-arnaudb.json [23:11:48] (03PS1) 10Dduvall: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 [23:14:05] (03CR) 10Dduvall: "FYI I noticed this bug while trying to test T337570 in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [23:31:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T343198)', diff saved to https://phabricator.wikimedia.org/P52439 and previous config saved to /var/cache/conftool/dbconfig/20230911-233135-arnaudb.json [23:31:39] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:46:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P52440 and previous config saved to /var/cache/conftool/dbconfig/20230911-234641-arnaudb.json